This geospatial statistical model uses routinely collected malaria case data, population data and remotely sensed data, such as open and vegetated water bodies, to estimate population living around open water bodies, expected malaria cases, and standardised morbidity ratio (SMR) of malaria. And ultimately, quantify the association between proximity to larval habitat and malaria risk in health facility catchment areas in Kasungu. The SMR compares the risk of morbidity in a population of interest with that of a standard population. In this case, our interest is to find out whether the number of dry season malaria cases in each catchment area are greater than we would expect given the malaria rate for the entire Kasungu district.

We do this by comparing what we observe (O) with what we would expect (E) if the risk of malaria was equal throughout Kasungu. The SMR statistical notation of catchment i can be written as follows: \[SMR_i = \frac{O_i}{E_i}\]

Buffers around waterbodies are created and then combined with population data in raster format to estimate the proprtion of catcment population living within 1km, 2km and 3km of water bodies. Subsequently, the observed malaria cases are modeled using Poisson regression to find out if living within various distances from water bodies is causing variability in malaria risk in Kasungu district. We hypothesize that the risk of being a case in a catchment is dependent on proximity to water bodies. The data used spans from 2017 to 2020 and was derived from digitized DHIS2 malaria records, accessibility mapping, aggregated population geospatial layer and TropWet tool in Google Earth Engine.

1 Load packages

Loading the R packages that will be used to read in, view, transform and model the malaria cases and spatial datasets.

library(SpatialEpi)
library(spdep)
library(spaMM)
library(popEpi)
library(Epi)
library(epitools)      # compute confidence intervals of malaria data
library(tidyverse)
library(caret)         # easily compute cross-validation methods to test model perfomance
library(stats)
library(performance)   # check linear model performance
library(see)           # to plot model assumptions
library(caTools)       # splitting data into training and test data sets
library(qqplotr)
library(automap)       # auto krigging
library(DescTools)     # tools for descriptive statistics e.g., pseudo R-squared
library(gtsummary)     # easily display regression model outputs, such as P value, CI
library(ggpubr)
library(plotly)
library(lubridate)
library(knitr)
library(raster)
library(rgdal)
library(rgeos)
library(sf)
library(sp)
library(tmap)
library(spdep)
library(maptools)
library(gridExtra)
library(ggsci)
library(grid)
library(exactextractr)
library(DataExplorer)
library(mapview)
library(kableExtra)    # create interactive tables
library(gt)            # create beautiful HTML tables
`%>%` <- magrittr::`%>%`

1.1 Tell R where the data is

file.path(getwd(),"data")
## [1] "C:/Users/cnkolokosa/Documents/R/upscaled_2021_updated_May/upscaled_2021/data"
here::here()
## [1] "C:/Users/cnkolokosa/Documents/R/upscaled_2021_updated_May/upscaled_2021"

2 Load datasets

The total dry season malaria cases recorded at health-care facilities in Kasungu from 2017 to 2019 are contained in the KasunguData.csv sourced from https://dhis2.health.gov.mw/. The kasungu_facility_catchments_2004.shp shapefile also contains the population and health information within each health-facility catchment area in Kasungu district.

The aggregated population raster layers for Malawi e.g.,ku_pop_2017_1km_aggregated.tif were downloaded from the Open Spatial and Demographic and Data Research website: https://www.worldpop.org/geodata/country?iso3=MWI. These layers estimate total number of people per grid-cell. The units are number of people per pixel with country totals adjusted to match the corresponding official United Nations population estimates. The datasets were downloaded in Geotiff at a resolution of 1km and are projected in Geographic Coordinate System, WGS84.

The kasungu_water.shpand water_bodies layers contain open and vegetated waterbodies polygons, detected using the Tropical Wetland Unmixing Tool (TropWet). TropWet is a Google Earth Engine hosted toolbox that uses the Landsat archive to map tropical wetlands and can be accessed through: https://www.aber.ac.uk/en/dges/research/earth-observation-laboratory/research/tropwet/

# Kasungu dry season malaria data
dry_season_malaria_2017_2020 <- read.csv(here::here("data/dry_season_malaria_2017_2020.csv"),
                                         stringsAsFactors = FALSE)

# Kasungu monthly NMCP confirmed malaria cases
monthly_malaria_2017_2021 <- read.csv(here::here("data/Kasungu monthly malaria 2017-2021.csv"),
                                      stringsAsFactors = FALSE)

monthly_malaria_2017_2021$date <- lubridate::ym(monthly_malaria_2017_2021$periodid) 

# Kasungu district boundary shapefile 
kasungu_district <- sf::st_read(here::here("data", "kasungu_district.shp"))
## Reading layer `kasungu_district' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\kasungu_district.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 491272.7 ymin: 8494349 xmax: 609044.2 ymax: 8632164
## Projected CRS: WGS 84 / UTM zone 36S
# Kasungu health facility catchments generated from accessibility mapping
malire_new <- sf::st_read(here::here("data", "zipatala_catchment_areas.shp")) |> 
              sf::st_transform(32736) # reproject to WGS UTM Zone 36 South
## Reading layer `zipatala_catchment_areas' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\zipatala_catchment_areas.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 26 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 32.925 ymin: -13.61667 xmax: 34.00833 ymax: -12.375
## Geodetic CRS:  WGS 84
# Kasungu population raster layer
kasungu_population_2017 <- raster(here::here("data", "ku_pop_2017_1km_aggregated.tif"))

kasungu_population_2018 <- raster(here::here("data", "ku_pop_2018_1km_aggregated.tif"))

kasungu_population_2019 <- raster(here::here("data", "ku_pop_2019_1km_aggregated.tif"))

kasungu_population_2020 <- raster(here::here("data", "ku_pop_2020_1km_aggregated.tif"))

# Read in waterbodies polygons 
dryseason_waterbodies_2017 <- sf::st_read(here::here("data", "water_bodies_2017.shp"))
## Reading layer `water_bodies_2017' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\water_bodies_2017.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 168 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 514497 ymin: 8495941 xmax: 603149.8 ymax: 8620169
## Projected CRS: WGS 84 / UTM zone 36S
dryseason_waterbodies_2018 <- sf::st_read(here::here("data", "kasungu_2018_water.shp"))
## Reading layer `kasungu_2018_water' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\kasungu_2018_water.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1105 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 496807.6 ymin: 8494693 xmax: 607913.8 ymax: 8607747
## Projected CRS: WGS 84 / UTM zone 36S
dryseason_waterbodies_2019 <- sf::st_read(here::here("data", "kasungu_2019_water.shp"))
## Reading layer `kasungu_2019_water' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\kasungu_2019_water.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1941 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 494197.2 ymin: 8494693 xmax: 607913.8 ymax: 8617573
## Projected CRS: WGS 84 / UTM zone 36S
dryseason_waterbodies_2020 <- sf::st_read(here::here("data", "water_bodies_2020.shp"))
## Reading layer `water_bodies_2020' from data source 
##   `C:\Users\cnkolokosa\Documents\R\upscaled_2021_updated_May\upscaled_2021\data\water_bodies_2020.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 266 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 508985.6 ymin: 8495793 xmax: 585761.1 ymax: 8620169
## Projected CRS: WGS 84 / UTM zone 36S
# Add a row ID to water bodies polygons 
dryseason_waterbodies_2017$ID <- 1:nrow(dryseason_waterbodies_2017)

dryseason_waterbodies_2018$ID <- 1:nrow(dryseason_waterbodies_2018)

dryseason_waterbodies_2019$ID <- 1:nrow(dryseason_waterbodies_2019)

dryseason_waterbodies_2020$ID <- 1:nrow(dryseason_waterbodies_2020)

2.1 View the dry season malaria case data

Lets have a closer look at the malaria dataset. We observe that Kasungu district has 30 health facilities classified as dispensary, health centre, district hospital and rural hospital, and the highest malaria cases were recorded at Kasungu District Hospital.

dry_season_malaria_2017_2020 |>
  summary()
##        X             rowID          Names              dr_2017       
##  Min.   : 1.00   Min.   : 1.00   Length:36          Min.   :    0.0  
##  1st Qu.: 9.75   1st Qu.: 9.75   Class :character   1st Qu.:  918.8  
##  Median :18.50   Median :18.50   Mode  :character   Median : 1505.0  
##  Mean   :18.50   Mean   :18.50                      Mean   : 1818.1  
##  3rd Qu.:27.25   3rd Qu.:27.25                      3rd Qu.: 2150.0  
##  Max.   :36.00   Max.   :36.00                      Max.   :11976.0  
##                                                                      
##     dr_2018        dr_2019         dr_2020         LONGITU     
##  Min.   :   0   Min.   :    0   Min.   :    0   Min.   :33.18  
##  1st Qu.:1180   1st Qu.: 1152   1st Qu.: 1650   1st Qu.:33.38  
##  Median :1706   Median : 1569   Median : 2698   Median :33.50  
##  Mean   :2093   Mean   : 2039   Mean   : 3404   Mean   :33.52  
##  3rd Qu.:2754   3rd Qu.: 2512   3rd Qu.: 4722   3rd Qu.:33.68  
##  Max.   :9820   Max.   :11399   Max.   :16435   Max.   :33.87  
##                                                 NA's   :6      
##     LATITUD      
##  Min.   :-13.57  
##  1st Qu.:-13.25  
##  Median :-12.98  
##  Mean   :-12.99  
##  3rd Qu.:-12.79  
##  Max.   :-12.42  
##  NA's   :6
# Plotly bar chart -------------------------------------------------------------
bar_chart <- dry_season_malaria_2017_2020 |>  
  dplyr::filter(Names != "K2 Taso Clinic",          # Have missing malaria records
                Names != "Kalikeni Private Clinic",
                Names != "Kakwale Health Centre",
                Names != "St Andrews Community Hospital",
                Names != "St. Faith Health Centre",
                Names != "Chambwe Health Centre") |> 
  plotly::plot_ly(y = ~Names,
                  x = ~dr_2017,
                  type = "bar",
                  orientation = 'h',
                  name = "2017") |>
  plotly::add_trace(x = ~ dr_2018,
                    name = "2018") |>
  plotly::add_trace(x = ~ dr_2019,
                    name = "2019") |>
  plotly::add_trace(x = ~ dr_2020,
                    name = "2020") |> 
  plotly::layout(xaxis = list(title = "Total malaria cases"),
                 yaxis = list(title = " "),
                 hovermode = "compare",
                 margin = list(b = 10,
                               t = 10,
                               pad = 2))
bar_chart

Fig.1 The total malaria cases recorded at each health-care facility in Kasungu district

# Pivot longer -----------------------------------------------------------------
# dry_season_malaria_longer <- dry_season_malaria_2017_2020 |>
#   dplyr::filter(Names != "K2 Taso Clinic",                       
#                 Names != "Kalikeni Private Clinic",
#                 Names != "Kakwale Health Centre",
#                 Names != "St Andrews Community Hospital",
#                 Names != "St. Faith Health Centre",
#                 Names != "Chambwe Health Centre") |>
#   dplyr::rename(`2017` = dr_2017,
#                 `2018` = dr_2018,
#                 `2019` = dr_2019,
#                 `2020` = dr_2020) |>
#   tidyr::pivot_longer(cols = `2017`:`2020`,
#                       names_to = 'year',
#                       values_to = 'malaria_cases')
# 
# ggplot2::ggplot(dry_season_malaria_longer, 
#                 aes(x = malaria_cases, 
#                     y = Names,
#                     fill = year))+
#   ggplot2::geom_bar(stat ='identity',
#                     position = "dodge")+
#   ggplot2::labs(x = "Dry season malaria cases",
#                 y = " ")+
#   ggplot2::theme_classic()+
#   ggsci::scale_fill_jama()

2.1.1 Monthly variation in malaria cases

ggplot2::ggplot(monthly_malaria_2017_2021) +
  aes(x = date, 
      y = NMCP) +
  geom_line(size = 0.6, 
            colour = "#112446") +
  scale_y_continuous(trans = "log2") +
  labs(x = "Year", 
       y = "Confirmed malaria cases") +
  theme_classic()
Fig.2: Seasonal variation in confirmed malaria cases

Fig.2: Seasonal variation in confirmed malaria cases

3 Kasungu health-care facilities and their catchment areas

Heath facility catchment area is the area from which a health facility attracts patients. Since the available official catchment areas are oudated and no recent spatial data about the catchment areas is available, new health facility catchments polygon were generated from generic accessibility mapping script adapted from https://malariaatlas.org/wp-content/uploads/accessibility/R_generic_accessibilty_mapping_script.r The script requires two user supplied datasets: the 2015 friction surface, which is available here: http://www.map.ox.ac.uk/accessibility_to_cities/, and a user-supplied .csv of points dry_season_malaria_2017_2020. The accumulated cost algorithm accCost and r.Cost algorithm in QGIS were run to make the final output map of new health facility catchment boundaries.

# Using the complete.cases() function to select health centres with complete 
# longitude and latitude coordinates.
zipatala <- dry_season_malaria_2017_2020[complete.cases(dry_season_malaria_2017_2020),] 

# Aggregate health facilities close to each other: 
#  a) Kasalika Health Centre and Kasungu District Hospital, 
#  b) Bua and Mziza Health Centres, and 
#  c) Kaluluma and Nkhamenya Rural Hospitals 
#  d) Anchor and Santhe Health Centres in order to 
# generate catchment areas that are geographically correct

zipatala$dr_2017[which(
  zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2017[which(
    zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2017[which(
      zipatala$Names == "Kasalika Health Centre")]

zipatala$dr_2018[which(
  zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2018[which(
    zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2018[which(
      zipatala$Names == "Kasalika Health Centre")]

zipatala$dr_2019[which(
  zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2019[which(
    zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2019[which(
      zipatala$Names == "Kasalika Health Centre")]

zipatala$dr_2020[which(
  zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2020[which(
    zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2020[which(
      zipatala$Names == "Kasalika Health Centre")]
  
zipatala$dr_2017[which(
  zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2017[which(
    zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2017[which(
      zipatala$Names == "Kaluluma Rural Hospital")]

zipatala$dr_2018[which(
  zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2018[which(
    zipatala$Names == "Nkhamenya Rural Hospital")] +zipatala$dr_2018[which(
      zipatala$Names == "Kaluluma Rural Hospital")]

zipatala$dr_2019[which(
  zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2019[which(
    zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2019[which(
      zipatala$Names == "Kaluluma Rural Hospital")]

zipatala$dr_2020[which(
  zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2020[which(
    zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2020[which(
      zipatala$Names == "Kaluluma Rural Hospital")]

zipatala$dr_2017[which(
  zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2017[which(
    zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2017[which(
      zipatala$Names == "Bua Health Centre")]

zipatala$dr_2018[which(
  zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2018[which(
    zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2018[which(
      zipatala$Names == "Bua Health Centre")]

zipatala$dr_2019[which(
  zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2019[which(
    zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2019[which(
      zipatala$Names == "Bua Health Centre")] 
  

zipatala$dr_2020[which(
  zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2020[which(
    zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2020[which(
      zipatala$Names == "Bua Health Centre")]

zipatala$dr_2017[which(
  zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2017[which(
    zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2017[which(
      zipatala$Names == "Anchor Farm")]

zipatala$dr_2018[which(
  zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2018[which(
    zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2018[which(
      zipatala$Names == "Anchor Farm")]

zipatala$dr_2019[which(
  zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2019[which(
    zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2019[which(
      zipatala$Names == "Anchor Farm")]

zipatala$dr_2020[which(
  zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2020[which(
    zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2020[which(
      zipatala$Names == "Anchor Farm")]


# Drop out the other health facilities
zipatala_aggregated <- zipatala |>
  dplyr::filter(Names != "Kasalika Health Centre",
                Names != "Bua Health Centre",
                Names != "Kaluluma Rural Hospital",
                Names != "Anchor Farm")

# Save csv file
# write.csv(zipatala_aggregated, "data/zipatala_aggregated.csv")

# Convert to csv to sf object
zipatala_aggregated_sf <- sf::st_as_sf(zipatala_aggregated,
                                        coords = c("LONGITU", "LATITUD"),
                                        crs = 4326, agr = "constant")

# st_write(zipatala_aggregated_sf, "data/zipatala_combined.shp")

3.1 View location of the health facilities in the new catchment areas

# Plot map
tm_shape(malire_new)+
  tm_polygons()+
  tm_shape(zipatala_aggregated_sf)+
  tm_dots(size = .3, 
          col = "blue", 
          alpha = 0.5)+
  tm_text("Names", 
          size = .3, 
          just = "top", 
          col = "black", 
          remove.overlap = TRUE)+
  tm_layout(frame = FALSE,
            title = "New Kasungu health facility \n catchment boundaries",
            title.size = .8, 
            title.position = c("left", "top"))+
  tm_compass(position=c("right", "top"))+
  tm_scale_bar(breaks = c(0, 10, 20), 
               text.size = .5)
Fig 3. Kasungu health-care facilities and catchment areas

Fig 3. Kasungu health-care facilities and catchment areas

3.2 Kasungu district estimated population per grid-cell

# Take a glimpse at the WorldPop raster layers
kasungu_population_2017
## class      : RasterLayer 
## dimensions : 150, 128, 19200  (nrow, ncol, ncell)
## resolution : 920.0898, 918.7667  (x, y)
## extent     : 491272.7, 609044.2, 8494349, 8632164  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs 
## source     : ku_pop_2017_1km_aggregated.tif 
## names      : ku_pop_2017_1km_aggregated
kasungu_population_2018
## class      : RasterLayer 
## dimensions : 150, 128, 19200  (nrow, ncol, ncell)
## resolution : 920.0898, 918.7667  (x, y)
## extent     : 491272.7, 609044.2, 8494349, 8632164  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs 
## source     : ku_pop_2018_1km_aggregated.tif 
## names      : ku_pop_2018_1km_aggregated 
## values     : 0, 6253.557  (min, max)
kasungu_population_2019
## class      : RasterLayer 
## dimensions : 150, 128, 19200  (nrow, ncol, ncell)
## resolution : 920.0898, 918.7667  (x, y)
## extent     : 491272.7, 609044.2, 8494349, 8632164  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs 
## source     : ku_pop_2019_1km_aggregated.tif 
## names      : ku_pop_2019_1km_aggregated 
## values     : 0, 6483.727  (min, max)
kasungu_population_2020
## class      : RasterLayer 
## dimensions : 150, 128, 19200  (nrow, ncol, ncell)
## resolution : 920.0898, 918.7667  (x, y)
## extent     : 491272.7, 609044.2, 8494349, 8632164  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs 
## source     : ku_pop_2020_1km_aggregated.tif 
## names      : ku_pop_2020_1km_aggregated 
## values     : 0, 7949.033  (min, max)
# Helper function to create a raster population map
create.population.map <- function(population.raster, title){
  # raster population map
  # arguments:
  #   population.raster:  aggregated population raster layer from WorldPop
  #   legend.title: legend title
  # returns:
  #   a tmap-element (plots a map)
  
  # set to interactive view
  # tmap::tmap_mode("view")
  
  # plot map
  tmap::tm_shape(population.raster)+
    tmap::tm_raster(palette = "-viridis", 
                    title = title,
                    breaks = c(0,100,200,400,600,800,1000,2000,4000,6000,8000))+
    tmap::tm_layout(legend.position = c("right", "bottom"),
                    frame = FALSE)+
    tmap::tm_scale_bar(position = c("left", "bottom"))
}
# Set to static map
 tmap_mode("plot")

estimated_pop_2017 <- create.population.map(kasungu_population_2017, title = "2017 Population")

estimated_pop_2018 <- create.population.map(kasungu_population_2018, title = "2018 Population")

estimated_pop_2019 <- create.population.map(kasungu_population_2019, title = "2019 Population")

estimated_pop_2020 <- create.population.map(kasungu_population_2020, title = "2020 Population")

# Layout the maps
tmap::tmap_arrange(estimated_pop_2017, estimated_pop_2018, 
                   estimated_pop_2019, estimated_pop_2020, nrow = 2) 
Fig.4 Estimated total number of people per 1km grid-cell

Fig.4 Estimated total number of people per 1km grid-cell

4 Assign dry season malaria cases and population density to new health facility catchments

The WorldPop aggregated population e.g. kasungu_population_2017.tif, and DHIS2 malaria dry_season_malaria_2017_2020 datasets are assigned to the new health facility catchments.

# Helper function that assigns malaria data from health facilities to their catchments areas -------
assign.malaria.data <- function(catchment_boundary, malaria_data){
  # arguments:
  #   catchment_boundary: sf polygon object of new catchment boundaries
  #   malaria_data: sf point object with a data frame containing the dry season malaria cases
  # returns:
  #   catchments_malaria_sf: sf polygon object with a data frame containing dry season malaria cases


  # Convert sf objects to spatial
  catchment_shp <- as(catchment_boundary, "Spatial")
  
  malaria_shp <- as(malaria_data, "Spatial")

  # Match CRS
  malaria_shp <- spTransform(malaria_shp, crs(catchment_shp))

  # Overlay aggregated health facility points and extract 2017 - 2020 malaria cases
  # Using 'point.in.poly' to return a point spatial object, in this case location of health facilities
  # and estimated population instead of sp::over function, which simply returns 
  # a data frame, with the same no. rows.
  # Argument 'sp = TRUE' returns an sp class object, else returns sf class object
  # Joining the malaria and population dataset using only 'merge' function can't work due to 
  # non-unique columns and differences in row numbers
  
  hospitals_in_catchment <- spatialEco::point.in.poly(malaria_shp, catchment_shp, sp = TRUE) 

  # Add the extracted ID, health facility names and dry season malaria cases to 
  # the health facility catchments (hfc)
  hfc_malaria_shp <- merge(catchment_shp, hospitals_in_catchment, by.x = "DN", by.y = "rowID")

  # Convert the shapefile containing malaria data to sf-object
  hfc_malaria_sf <- sf::st_as_sf(hfc_malaria_shp)

  # Tidy the data by dropping columns not needed
  catchment_malaria <- hfc_malaria_sf |> 
    dplyr::select(-c(coords.x1, coords.x2))

  return(out = catchment_malaria)
}


# Invoking the function ------------------------------------------------------------------
malaria_by_catchment <- assign.malaria.data(malire_new, zipatala_aggregated_sf)

5 Assign population data to the health catchment areas

# Helper unction to extract population from WorldPop raster file and assign ---------------------------
# the values to the new catchments.

extract.pop.values <- function(kasungu_pop_raster, catchments){
  # function to extract population from raster file and assign the population to catchments
  # arguments:
  #   kasungu_pop_raster: population raster file clipped to Kasungu district
  #   catchments: shapefile containing the polygons that we wish to use as boundaries
  # returns:
  #   catchments_malaria_pop_sf: sf polygon object containing malaria and population data
  
  # convert from sf to sp
  catchments_sp <- as(catchments, "Spatial")
  
  # Match extent i.e projection
  catchments_sp <- spTransform(catchments_sp, proj4string(kasungu_pop_raster))
  
  # Crop and mask the population raster to exclude Kasungu National Park
  pop_raster_clip <- raster::crop(kasungu_pop_raster, extent(catchments_sp)) |>
    raster::mask(catchments_sp)
  # pop_raster_clip <- raster::mask(raster::crop(kasungu_pop_raster, extent(catchments_sp)), catchments_sp)
  
  # Extracting zonal statistics from a population raster layer. 
  # The population raster is a continuous gridded surface layer that has an 
  # estimated population density value to every square in their grid. 
  # The population values are then summed and apportioned to the catchment polygons
  # catchments_malaria_pop <- catchments %>% 
  #   dplyr::mutate(pop = round(raster::extract(pop_raster_clip, catchments, fun = sum, na.rm = TRUE)))
  
  pop_by_catchment <- round(raster::extract(pop_raster_clip, catchments, fun = sum, na.rm = TRUE))
  
  pop_by_catchment_df <-  pop_by_catchment %>%  
  # apply unlist to the lists to have vectors as the list elements
  lapply(unlist) %>%  
  # convert vectors to data.frames
  lapply(as_tibble) %>%   
  # combine the list of data.frames
  bind_rows(., .id = "rowID") %>%   
  # rename the value variable
  dplyr::rename(pop = value)
  
  # Add row ID to column to catchment layer
  catchments$rowID <- 1:nrow(catchments)
  
  # Merge catchment areas and population data 
  pop_by_catchments <- merge(catchments, pop_by_catchment_df, by = "rowID")
  
  # Cleaning 'Inf' values
  pop_by_catchments |> 
    dplyr::mutate_if(is.numeric, list(~na_if(., Inf))) |> 
    dplyr::mutate_if(is.numeric, list(~na_if(., -Inf)))

  return(out = pop_by_catchments)
  
}

# Invoking the function -------------------------------------------------------------------------
malaria_pop_by_catchment_2017 <- extract.pop.values(kasungu_population_2017, malaria_by_catchment)

malaria_pop_by_catchment_2018 <- extract.pop.values(kasungu_population_2018, malaria_by_catchment)

malaria_pop_by_catchment_2019 <- extract.pop.values(kasungu_population_2019, malaria_by_catchment)

malaria_pop_by_catchment_2020 <- extract.pop.values(kasungu_population_2020, malaria_by_catchment)

5.1 View population by catchment maps

Estimated total number of people within health facility catchment areas.

# max(malaria_pop_by_catchment_2017$pop)
# [1] 143490
# max(malaria_pop_by_catchment_2018$pop)
# [1] 147175
# max(malaria_pop_by_catchment_2019$pop)
# [1] 151079
# max(malaria_pop_by_catchment_2020$pop)
# [1] 185282

# Custom function to create maps of estimated population by catchment areas --------------------------
create.population.map <- function(catchment.area, 
                                  variable = "pop", 
                                  title, 
                                  legend.title = "Estimated \n population"){
  # estimated population map
  # catchment.area: estimated population layer from nachulu function
  # variable: variable name (as character, in qoutes)
  # title: map title in quotes
  # legend.title: legend title in qoutes
  # returns:
  #   a tmap-element (plots a map)
  tm_shape(catchment.area)+
    tm_fill(col = variable, 
            breaks = c(0, 13000, 19000, 27000, 35000, 70000, 140000, 200000),
            palette = "YlOrBr",
            title = legend.title)+
    tm_borders(col = "grey",
               lwd = 0.4)+
    tm_layout(legend.position = c(0.75, "bottom"),
              legend.text.size = 0.6,
              legend.title.size = 0.8,
              frame = FALSE)+
    tm_credits(title, 
               position = c(0.3, 0.8), 
               size = 1)
}

# Invoking the function --------------------------------------------------------------------------
pop_by_catchment_2017 <- create.population.map(malaria_pop_by_catchment_2017, title = "2017")

pop_by_catchment_2018 <- create.population.map(malaria_pop_by_catchment_2018, title = "2018")

pop_by_catchment_2019 <- create.population.map(malaria_pop_by_catchment_2019, title = "2019")

pop_by_catchment_2020 <- create.population.map(malaria_pop_by_catchment_2020, title = "2020")

tmap::tmap_arrange(pop_by_catchment_2017, pop_by_catchment_2018,
                   pop_by_catchment_2019, pop_by_catchment_2020, ncol = 2)
Fig. 5: Estimated population by health facility catchment areas

Fig. 5: Estimated population by health facility catchment areas

5.2 Population density by catchment

# Helper function to calculate population density by catchment -----------------
calculate.population.density <- function(pop.data){
  
  # Convert to spatial object
  pop.sp <- as(pop.data, "Spatial")

  # Calculate area of catchment polygon in square kilometres
  pop.sp$area_sqkm <- round(rgeos::gArea(pop.sp, byid = TRUE) / (1000 * 1000))

  # Calculate population density
  pop.sp$pop_density <- round(pop.sp$pop / pop.sp$area_sqkm)
  
  # Convert back to sf object
  pop.sf <- sf::st_as_sf(pop.sp)
  
  return(pop.sf)
}

# Invoking function ------------------------------------------------------------
pop_density_2017 <- calculate.population.density(malaria_pop_by_catchment_2017)

pop_density_2018 <- calculate.population.density(malaria_pop_by_catchment_2018)

pop_density_2019 <- calculate.population.density(malaria_pop_by_catchment_2019)

pop_density_2020 <- calculate.population.density(malaria_pop_by_catchment_2020)

# Helper function to create population density maps ----------------------------
create.pop.density.map <- function(pop.density.data,
                                   variable = "pop_density", 
                                   title = NA, 
                                   legend.title = "Population \ndensity/km^2"){
  tm_shape(pop.density.data)+
    tm_fill(col = variable, 
            breaks = c(0, 50, 100, 150, 200, 250, 300, 350),
            palette = "-magma",
            title = legend.title)+
    tm_borders()+
    tm_layout(legend.position = c(0.75, "bottom"),
              legend.text.size = 0.6,
              legend.title.size = 0.8,
              frame = FALSE)+
    tm_credits(title, 
               position = c(0.3, 0.8), 
               size = 1)
}

# Invoking function ------------------------------------------------------------
pop_density_2017_map <- create.pop.density.map(pop_density_2017, title = "2017")

pop_density_2018_map <- create.pop.density.map(pop_density_2018, title = "2018")

pop_density_2019_map <- create.pop.density.map(pop_density_2019, title = "2019")

pop_density_2020_map <- create.pop.density.map(pop_density_2020, title = "2020")

# Layout maps
tmap::tmap_arrange(pop_density_2017_map, pop_density_2018_map,
                   pop_density_2019_map, pop_density_2020_map, ncol = 2)
Fig. 6: Estimated population density by health facility catchment areas

Fig. 6: Estimated population density by health facility catchment areas

6 Calculate the expected number of cases for each catchment area

The expected number of dry season malaria cases in catchment i are calculated as the observed risk (r) of malaria i.e. the total number of malaria cases in Kasungu district divided by the total population of the district, multiplied by the number of people in the catchment area: \[E_i = \frac{\sum_i O_i}{\sum_i N_i}\times N_i\]

The expected number of dry season malaria cases are calculated under the assumption that there is no spatial variation in risk, i.e., no difference in infection rates between the catchment areas.

# Compute and print the overall incidence of dry season malaria cases
overall_malaria_incidence_rate_2017 <- round(sum(
  malaria_pop_by_catchment_2017$dr_2017) / sum(
    malaria_pop_by_catchment_2017$pop), 2)

overall_malaria_incidence_rate_2017
## [1] 0.08
overall_malaria_incidence_rate_2018 <- round(sum(
  malaria_pop_by_catchment_2018$dr_2018) / sum(
    malaria_pop_by_catchment_2018$pop), 2)

overall_malaria_incidence_rate_2018
## [1] 0.09
overall_malaria_incidence_rate_2019 <- round(sum(
  malaria_pop_by_catchment_2019$dr_2019) / sum(
    malaria_pop_by_catchment_2019$pop), 2)

overall_malaria_incidence_rate_2019
## [1] 0.09
overall_malaria_incidence_rate_2020 <- round(sum(
  malaria_pop_by_catchment_2020$dr_2020) / sum(
    malaria_pop_by_catchment_2020$pop), 2)

overall_malaria_incidence_rate_2020
## [1] 0.12
# Calculate expected malaria cases ------------------------------------------------
expected_malaria_2017 <- malaria_pop_by_catchment_2017 |>
  dplyr::rename(
    observed_2017 = dr_2017,
     pop_2017 = pop) |> 
  dplyr::mutate(
    expected_2017 = round(sum(observed_2017)/sum(pop_2017, na.rm = TRUE)*pop_2017))

expected_malaria_2018 <- malaria_pop_by_catchment_2018 |>
  dplyr::rename(
    observed_2018 = dr_2018,
    pop_2018 = pop) |> 
  dplyr::mutate(
    expected_2018 = round(sum(observed_2018)/sum(pop_2018, na.rm = TRUE)*pop_2018))

expected_malaria_2019 <- malaria_pop_by_catchment_2019 |>
  dplyr::rename(
    observed_2019 = dr_2019,
    pop_2019 = pop) |>
  dplyr::mutate(
    expected_2019 = round(sum(observed_2019)/sum(pop_2019, na.rm = TRUE)*pop_2019)) 

expected_malaria_2020 <- malaria_pop_by_catchment_2020 |>
  dplyr::rename(
    observed_2020 = dr_2020,
    pop_2020 = pop) |>
  dplyr::mutate(
    expected_2020 = round(sum(observed_2020)/sum(pop_2020, na.rm = TRUE)*pop_2020))

7 Calculate the Standardised Morbidity Ratio of malaria incidences for each catchment area

The SMR compares the risk of morbidity in a population of interest with that of a standard population. In this case, our interest is to find out whether the number of dry season malaria cases in each catchment area are greater than we would expect given the malaria rate for the entire Kasungu district.

We do this by comparing what we observe (O) with what we would expect (E) if the risk of malaria was equal throughout Kasungu. The SMR of catchment i can be calculated as follows: \[SMR_i = \frac{O_i}{E_i}\]

SMRs above 1 represent high rates of dry season malaria and SMRs below 1, viceversa.

# Calculate the ratio of observed to expected (SMR) ----------------------------
SMR_2017 <- expected_malaria_2017 |>
  dplyr::mutate(SMR = round(observed_2017/expected_2017, 1)) |> 
  dplyr::select(rowID,Names, pop_2017, observed_2017, expected_2017, SMR) 

SMR_2018 <- expected_malaria_2018 |> 
  dplyr::mutate(SMR = round(observed_2018/expected_2018, 1)) |> 
  dplyr::select(rowID, Names, pop_2018, observed_2018, expected_2018, SMR) 

SMR_2019 <- expected_malaria_2019 |> 
  dplyr::mutate(SMR = round(observed_2019/expected_2019, 1)) |> 
  dplyr::select(rowID, Names, pop_2019, observed_2019, expected_2019, SMR) 

SMR_2020 <- expected_malaria_2020 |> 
  dplyr::mutate(SMR = round(observed_2020/expected_2020, 1)) |> 
  dplyr::select(rowID, Names, pop_2020, observed_2020, expected_2020, SMR)


# Create SMR tables ------------------------------------------------------------
SMR_table_2017 <- SMR_2017 |>
  dplyr::as_tibble() |>
  dplyr::rename(Health_facility = Names) |> 
  dplyr::select(-rowID, -geometry) |> 
  kable() |>
  kableExtra::kable_styling(full_width = FALSE)

SMR_table_2018 <- SMR_2018 |> 
  dplyr::as_tibble() |> 
  dplyr::rename(Health_facility = Names) |> 
  dplyr::select(-rowID, -geometry) |>
  kable() |> 
  kableExtra::kable_styling(full_width = FALSE)

SMR_table_2019 <- SMR_2019 |>
  dplyr::as_tibble() |>
  dplyr::rename(Health_facility = Names) |> 
  dplyr::select(-rowID, -geometry) |> 
  kable () |> 
  kableExtra::kable_styling(full_width = FALSE)

SMR_table_2020 <- SMR_2020 |>
  dplyr::as_tibble() |> 
  dplyr::rename(Health_facility = Names) |> 
  dplyr::select(-rowID, -geometry) |> 
  kable () |> 
  kableExtra::kable_styling(full_width = FALSE)

SMR_table_2017
Health_facility pop_2017 observed_2017 expected_2017 SMR
Lodjwa Health Centre 9923 564 826 0.7
Nkhamenya Rural Hospital 40154 2720 3344 0.8
Newa Mpasazi Health Centre 13879 216 1156 0.2
Mpepa /Chisinga Health Centre 27459 1523 2287 0.7
Mnyanja Health Centre 39950 1480 3327 0.4
Simlemba Health Centre 26999 1159 2249 0.5
Ofesi Health Centre 28098 1930 2340 0.8
Chulu Health Centre 27906 3482 2324 1.5
Kapelula Health Centre 35727 2970 2976 1.0
Livwezi Health Centre 22009 594 1833 0.3
Gogode Dispensary 13061 1553 1088 1.4
Dwangwa Dispensary 32704 1153 2724 0.4
Chamama Health Facility 20026 1005 1668 0.6
Wimbe Health Centre 11864 2558 988 2.6
Chinyama 12768 1140 1063 1.1
Mdunga Health Centre 18177 1382 1514 0.9
Mtunthama Health Centre 18744 1982 1561 1.3
Kasungu District Hospital 143490 14663 11951 1.2
Chamwabvi Health Centre 35353 2031 2945 0.7
Linyangwa Health Centre 17772 1987 1480 1.3
Mziza Health Centre 44295 4098 3689 1.1
Kawamba Health Centre 26385 3845 2198 1.7
Kamboni Health Centre 21226 2588 1768 1.5
Khola Health Centre 22664 1012 1888 0.5
Santhe Health Centre 43948 5668 3660 1.5
Mkhota Health Centre 23295 1487 1940 0.8
SMR_table_2018
Health_facility pop_2018 observed_2018 expected_2018 SMR
Lodjwa Health Centre 10281 1151 934 1.2
Nkhamenya Rural Hospital 41642 3343 3785 0.9
Newa Mpasazi Health Centre 14248 434 1295 0.3
Mpepa /Chisinga Health Centre 28488 2616 2589 1.0
Mnyanja Health Centre 41856 1715 3804 0.5
Simlemba Health Centre 27455 1506 2496 0.6
Ofesi Health Centre 29002 1773 2636 0.7
Chulu Health Centre 28832 3330 2621 1.3
Kapelula Health Centre 37630 3480 3420 1.0
Livwezi Health Centre 22544 1128 2049 0.6
Gogode Dispensary 13368 2550 1215 2.1
Dwangwa Dispensary 33534 1216 3048 0.4
Chamama Health Facility 20372 1226 1852 0.7
Wimbe Health Centre 11814 3167 1074 2.9
Chinyama 13138 1673 1194 1.4
Mdunga Health Centre 18928 1894 1720 1.1
Mtunthama Health Centre 19074 3358 1734 1.9
Kasungu District Hospital 147175 12019 13377 0.9
Chamwabvi Health Centre 36167 2079 3287 0.6
Linyangwa Health Centre 18032 1500 1639 0.9
Mziza Health Centre 46313 2291 4210 0.5
Kawamba Health Centre 26253 3881 2386 1.6
Kamboni Health Centre 21430 3250 1948 1.7
Khola Health Centre 23195 1697 2108 0.8
Santhe Health Centre 45063 6195 4096 1.5
Mkhota Health Centre 23884 4218 2171 1.9
SMR_table_2019
Health_facility pop_2019 observed_2019 expected_2019 SMR
Lodjwa Health Centre 10608 1168 909 1.3
Nkhamenya Rural Hospital 43293 3932 3709 1.1
Newa Mpasazi Health Centre 14780 626 1266 0.5
Mpepa /Chisinga Health Centre 29456 4169 2523 1.7
Mnyanja Health Centre 43783 2504 3751 0.7
Simlemba Health Centre 28076 1788 2405 0.7
Ofesi Health Centre 30065 2124 2576 0.8
Chulu Health Centre 29731 3537 2547 1.4
Kapelula Health Centre 39747 3357 3405 1.0
Livwezi Health Centre 22945 435 1966 0.2
Gogode Dispensary 13641 1469 1169 1.3
Dwangwa Dispensary 34415 1370 2948 0.5
Chamama Health Facility 20701 1127 1773 0.6
Wimbe Health Centre 11855 2162 1016 2.1
Chinyama 13475 1260 1154 1.1
Mdunga Health Centre 19960 1485 1710 0.9
Mtunthama Health Centre 19385 1718 1661 1.0
Kasungu District Hospital 151079 13052 12942 1.0
Chamwabvi Health Centre 36899 1180 3161 0.4
Linyangwa Health Centre 18279 2692 1566 1.7
Mziza Health Centre 48452 3135 4151 0.8
Kawamba Health Centre 26356 3469 2258 1.5
Kamboni Health Centre 21509 2537 1843 1.4
Khola Health Centre 23816 2139 2040 1.0
Santhe Health Centre 46196 5793 3957 1.5
Mkhota Health Centre 24429 2268 2093 1.1
SMR_table_2020
Health_facility pop_2020 observed_2020 expected_2020 SMR
Lodjwa Health Centre 13081 1788 1538 1.2
Nkhamenya Rural Hospital 53692 8539 6313 1.4
Newa Mpasazi Health Centre 18311 2182 2153 1.0
Mpepa /Chisinga Health Centre 36317 5186 4270 1.2
Mnyanja Health Centre 54649 6117 6426 1.0
Simlemba Health Centre 34240 5310 4026 1.3
Ofesi Health Centre 37240 2323 4379 0.5
Chulu Health Centre 36638 7160 4308 1.7
Kapelula Health Centre 50214 7297 5904 1.2
Livwezi Health Centre 27786 1028 3267 0.3
Gogode Dispensary 16681 2767 1961 1.4
Dwangwa Dispensary 42282 2869 4971 0.6
Chamama Health Facility 25248 635 2969 0.2
Wimbe Health Centre 14367 2233 1689 1.3
Chinyama 16463 1605 1936 0.8
Mdunga Health Centre 25108 3169 2952 1.1
Mtunthama Health Centre 23501 1882 2763 0.7
Kasungu District Hospital 185282 19393 21785 0.9
Chamwabvi Health Centre 45106 1128 5304 0.2
Linyangwa Health Centre 22144 4380 2604 1.7
Mziza Health Centre 60648 5791 7131 0.8
Kawamba Health Centre 32076 7073 3771 1.9
Kamboni Health Centre 25750 4665 3028 1.5
Khola Health Centre 29367 3426 3453 1.0
Santhe Health Centre 56704 6556 6667 1.0
Mkhota Health Centre 29986 4592 3526 1.3

7.1 View observed and expected dry season malaria cases

# Helper function to create maps of observed and expected dry season malaria cases
create.malaria.map <- function(malaria.data, 
                               variable = NA, 
                               title = NA, 
                               legend.title = NA){
  # observed and expected malaria incidence map
  # malaria.data: data frame containing observed and expected malaria cases
  # variable: variable name (as character, in quotes e.g. "observed")
  # title: map title in quotes
  # legend.title: legend title in quotes
  # returns:
  #   a tmap-element (plots a map)
  tm_shape(malaria.data)+
    tm_fill(col = variable, 
            breaks = c(0, 500, 1000, 2500, 5000, 10000, 15000, 20000, 25000),
            palette = "YlOrRd",
            title = legend.title)+
    tm_borders(lw = 0.3)+
    tm_layout(legend.position = c(0.75,"bottom"),
              legend.text.size = 0.5,
              legend.title.size = 0.7,
              frame = FALSE)+
    tm_credits(title, 
               position = c(0.2, 0.8), 
               size = 1)
}

# Invoking the function
# 2017 observed and expected malaria cases -------------------------------------
observed_malaria_2017_map <- create.malaria.map(malaria_pop_by_catchment_2017, 
                                                variable = "dr_2017",
                                                title = "2017",
                                                legend.title = "Observed malaria")

expected_malaria_2017_map <- create.malaria.map(expected_malaria_2017,
                                                variable = "expected_2017",
                                                title = "2017",
                                                legend.title = "Expected malaria")

# 2018 observed and expected malaria cases -------------------------------------
observed_malaria_2018_map <- create.malaria.map(malaria_pop_by_catchment_2018,
                                                variable = "dr_2018",
                                                title = "2018",
                                                legend.title = "Observed malaria")

expected_malaria_2018_map <- create.malaria.map(expected_malaria_2018,
                                                variable = "expected_2018",
                                                title = "2018",
                                                legend.title = "Expected malaria")

# 2019 observed and expected malaria cases -------------------------------------
observed_malaria_2019_map <- create.malaria.map(malaria_pop_by_catchment_2019,
                                                variable = "dr_2019",
                                                title = "2019",
                                                legend.title = "Observed malaria")

expected_malaria_2019_map <- create.malaria.map(expected_malaria_2019,
                                                variable = "expected_2019",
                                                title = "2019",
                                                legend.title = "Expected malaria")

# 2020 observed and expected malaria cases -------------------------------------
observed_malaria_2020_map <- create.malaria.map(malaria_pop_by_catchment_2020,
                                                variable = "dr_2020",
                                                title = "2020",
                                                legend.title = "Observed malaria")

expected_malaria_2020_map <- create.malaria.map(expected_malaria_2020,
                                                variable = "expected_2020",
                                                title = "2020",
                                                legend.title = "Expected malaria")

# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(observed_malaria_2017_map, expected_malaria_2017_map,
                   observed_malaria_2018_map, expected_malaria_2018_map, 
                   observed_malaria_2019_map, expected_malaria_2019_map,
                   observed_malaria_2020_map, expected_malaria_2020_map, ncol = 2)
Fig. 7: Observed and expected malaria incidence by health facility catchment area, Kasungu

Fig. 7: Observed and expected malaria incidence by health facility catchment area, Kasungu

7.2 View SMR by catchment

A ratio greater than 1.0 indicates that more malaria cases have occurred than would have been expected, while a ratio less than 1.0 indicates that less cases have occurred.

# max(SMR_2017$SMR)
# [1] 2.6
# max(SMR_2018$SMR)
# [1] 2.9
# max(SMR_2019$SMR)
# [1] 2.1
# max(SMR_2020$SMR)
# [1] 1.9

# Define function to create maps of SMR by catchment ---------------------------
create.smr.map <- function(smr.data, 
                           variable = "SMR_category", 
                           title = NA, 
                           legend.title = "SMR"){
  # SMR by catchment map
  # smr.data: sf polygon object containing SMR by catchment data
  # variable: variable name (as character, in qoutes)
  # title: map title in quotes
  # legend.title: legend title in qoutes
  # returns:
  #   a tmap-element (plots a map)
  
  # create category column
  smr.data$SMR_category <- NA
  
  # assigning labels for the SMR estimate legends
  smr.category.list <- c("<0.50", "0.51 to 0.75", "0.76 to 0.99", "1.00", 
                         "1.10 to 1.24", "1.25 to 1.49", ">1.50")
  
  # assigning categories
  smr.data$SMR_category[smr.data$SMR >= 0.00 & smr.data$SMR < 0.49] = -3
  smr.data$SMR_category[smr.data$SMR >= 0.50 & smr.data$SMR < 0.75] = -2
  smr.data$SMR_category[smr.data$SMR >= 0.76 & smr.data$SMR < 0.99] = -1
  smr.data$SMR_category[smr.data$SMR >= 1.00 & smr.data$SMR < 1.09] = 0
  smr.data$SMR_category[smr.data$SMR >= 1.10 & smr.data$SMR < 1.24] = 1
  smr.data$SMR_category[smr.data$SMR >= 1.25 & smr.data$SMR < 1.49] = 2
  smr.data$SMR_category[smr.data$SMR >= 1.50 & smr.data$SMR < 3.00] = 3
  
  # generating divergent colour schemes [Blues - Light Blues – White – Light Reds – Reds]
  # smr.palette <- c("#33a6fe", "#cbe6fe", "#dfeffe", "#feb1b1", "#fe8e8e","#fe0000")
  
  tm_shape(smr.data)+
    tm_fill(col = variable, 
            style = "cat",
            palette = "-RdBu",
            title = legend.title,
            labels = smr.category.list)+
    tm_borders(lw = 0.6)+
    tm_layout(legend.position = c(0.75,"bottom"),
              legend.text.size = 0.5,
              legend.title.size = 0.7,
              frame = FALSE)+
    tm_credits(title, 
               position = c(0.2, 0.8), 
               size = 1)
}

# Invoking function ------------------------------------------------------------
SMR_2017_map <- create.smr.map(SMR_2017, title = "2017")

SMR_2018_map <- create.smr.map(SMR_2018, title = "2018")

SMR_2019_map <- create.smr.map(SMR_2019, title = "2019")

SMR_2020_map <- create.smr.map(SMR_2020, title = "2020")

# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(SMR_2017_map, SMR_2018_map, SMR_2019_map, SMR_2020_map, ncol = 2)
Fig. 8: Standardised morbidity ratio of malaria by health facility catchment

Fig. 8: Standardised morbidity ratio of malaria by health facility catchment

8 Calculate the proportion of the catchment population living within 1km, 2km, 3km of water bodies

First, using st_buffer, we compute 1km, 2km and 3km buffers around dry season water bodies obtained from LandSat satellite imagery using TropWet tool in Google Earth Engine. Then geometry of the buffer features are then combined resulting in resolved internal boundaries to enable extracting population values from WorldPop raster. Finally, we calculate the proportion of people in each catchment area living within water bodies.

# Combine and transform TropWet derived waterbody polygons -------------------------------

surface_waterbodies_2017 <- sf::st_buffer(dryseason_waterbodies_2017, dist = 30) |>
  sf::st_union() |>
  sf::st_cast("POLYGON") |>
  sf::st_as_sf()

surface_waterbodies_2018 <- sf::st_buffer(dryseason_waterbodies_2018, dist = 30) |>
  sf::st_union() |>
  sf::st_cast("POLYGON") |>
  sf::st_as_sf()

surface_waterbodies_2019 <- sf::st_buffer(dryseason_waterbodies_2019, dist = 30) |>
  sf::st_union() |>
  sf::st_cast("POLYGON") |>
  sf::st_as_sf()

surface_waterbodies_2020 <- sf::st_buffer(dryseason_waterbodies_2020, dist = 30) |>
  sf::st_union() |>
  sf::st_cast("POLYGON") |>
  sf::st_as_sf()


# Helper function to compute 1km, 2km and 3km buffers around the water bodies ---------------------

create.waterbody.buffer <- function(waterbody, distance, catchment){
  # function for creating buffers around waterbodies
  # arguments:
  #   waterbody:  waterbody shapefile
  #   distance: buffer distance in meters
  #   catchment: catchment area shapefile
  # returns:
  #   buffered waterbodies 
  
  # Create buffers around water bodies
  buffer_radius <- sf::st_buffer(waterbody, distance)
  
  # Dissolve the buffers
  # buffer_union <- sf::st_as_sf(st_cast(st_union(buffer_radius),"MULTIPOLYGON"))
  buffer_union <- sf::st_union(buffer_radius) |>
    sf::st_cast("MULTIPOLYGON") |>
    sf::st_as_sf()
  
  # Assign attributes of the 'catchment' to each of the water bodies. 
   buffer_intersect <- sf::st_intersection(buffer_union, catchment)
  
   buffer_intersect_sf <- sf::st_as_sf(buffer_intersect)
   
  # Convert the MULTIPOLYGON object into several POLYGON objects
   # buffer_intersect_polygons <- sf::st_cast(
   #   sf::st_buffer(buffer_intersect_sf,0.0), "MULTIPOLYGON") %>% 
   #   sf::st_cast("POLYGON")
   buffer_intersect_polygons <- sf::st_buffer(buffer_intersect_sf, 0.0) |>
     sf::st_cast("MULTIPOLYGON") |>
     sf::st_cast("POLYGON")
   
  
  # Polygons being seen to be in multiple catchments
   sf::st_intersects(buffer_intersect_polygons, catchment)
  
  # Make the assumption that the attribute is constant throughout the geometry
   sf::st_agr(buffer_intersect_polygons) = "constant"
   
   sf::st_agr(catchment) = "constant"
  
  return(out = buffer_intersect_polygons)
}


# Invoking function
# For 2017 TropWet surface water polygons --------------------------------------------------------
buffer_1km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017, 
                                           distance = 1000, 
                                           catchment = malire_new)

buffer_2km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017, 
                                           distance = 2000, 
                                           catchment = malire_new)

buffer_3km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017, 
                                           distance = 3000,
                                           catchment = malire_new)

# For 2018 TropWet surface water polygons --------------------------------------------------------
buffer_1km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018, 
                                           distance = 1000, 
                                           catchment = malire_new)

buffer_2km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018, 
                                           distance = 2000, 
                                           catchment = malire_new)

buffer_3km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018, 
                                           distance = 3000, 
                                           catchment = malire_new)
 
# For 2019 TropWet surface water polygons ------------------------------------------------------
buffer_1km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019, 
                                           distance = 1000, 
                                           catchment = malire_new)

buffer_2km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019, 
                                           distance = 2000, 
                                           catchment = malire_new)

buffer_3km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019, 
                                           distance = 3000, 
                                           catchment = malire_new)

# For 2020 TropWet surface water polygons ------------------------------------------------------
buffer_1km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020, 
                                           distance = 1000, 
                                           catchment = malire_new)

buffer_2km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020, 
                                           distance = 2000, 
                                           catchment = malire_new)

buffer_3km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020, 
                                           distance = 3000, 
                                           catchment = malire_new)

8.1 View the created waterbody buffers

Note that blank areas in the map represent catchment areas in which no water body was detected — this may be a limitation of using moderate resolution satellite imagery (>30m spatial resolution) to identify surface water.

# Map the buffers
create.buffer.map <- function(buffers, boundary = malire_new, title = NA){
  # function for creating buffer map in ggplot
  # arguments:
  #   buffer:  waterbodies buffer polygon layer
  #   boundary: health facility catchment polygons
  #   title: main title
  # returns:
  #   a map-element (plots a map)
  ggplot(data = buffers)+
     geom_sf()+
     geom_sf(data = boundary, 
             fill = NA)+
     theme_void()+
     labs(title = title)
}

# Invoking the function
# For 2017 -------------------------------------------------------------------------------
buffer_1km_2017_map <- create.buffer.map(buffer_1km_2017, title = "2017: 1km Buffers")

buffer_2km_2017_map <- create.buffer.map(buffer_2km_2017, title = "2017: 2km Buffers")

buffer_3km_2017_map <- create.buffer.map(buffer_3km_2017, title = "2017: 3km Buffers")

# For 2018 --------------------------------------------------------------------------------
buffer_1km_2018_map <- create.buffer.map(buffer_1km_2018, title = "2018: 1km Buffers")

buffer_2km_2018_map <- create.buffer.map(buffer_2km_2018, title = "2018: 2km Buffers")

buffer_3km_2018_map <- create.buffer.map(buffer_3km_2018, title = "2018: 3km Buffers")

# For 2019 ---------------------------------------------------------------------------------
buffer_1km_2019_map <- create.buffer.map(buffer_1km_2019, title = "2019: 1km Buffers")

buffer_2km_2019_map <- create.buffer.map(buffer_2km_2019, title = "2019: 2km Buffers")

buffer_3km_2019_map <- create.buffer.map(buffer_3km_2019, title = "2019: 3km Buffers")

# For 2020 --------------------------------------------------------------------------------
buffer_1km_2020_map <- create.buffer.map(buffer_1km_2020, title = "2020: 1km Buffers")

buffer_2km_2020_map <- create.buffer.map(buffer_2km_2020, title = "2020: 2km Buffers")

buffer_3km_2020_map <- create.buffer.map(buffer_3km_2020, title = "2020: 3km Buffers")
 
grid.arrange(buffer_1km_2017_map, buffer_1km_2018_map, buffer_1km_2019_map, buffer_1km_2020_map,
             buffer_2km_2017_map, buffer_2km_2018_map, buffer_2km_2019_map, buffer_2km_2020_map, 
             buffer_3km_2017_map, buffer_3km_2018_map, buffer_3km_2019_map, buffer_3km_2020_map, ncol = 4)
Fig 9. Buffers around dry season waterbodies in Kasungu

Fig 9. Buffers around dry season waterbodies in Kasungu

9 Extract the population living within waterbody buffers by catchment area

# Helper function to calculate estimated number of people living within waterbody buffers
# in each catchment area
estimate.buffer.pop <- function(catchment.population, buffers, catchment.area){
  
  # Extract population estimates from WorldPop raster
  buffers$buffer_pop <- raster::extract(catchment.population,
                                        buffers, 
                                        fun = sum, 
                                        na.rm = TRUE)
                                               
                                              
  # Find which catchment each polygon belongs to using its centroid - a point dataset 
  # representing the geographic center-points of the polygons 
  # buffer_by_catchment <- st_intersection(st_centroid(buffers), catchment.area)
  buffer_by_catchment <- sf::st_centroid(buffers) |>
    sf::st_intersection(catchment.area)
  
  # Notice that the buffer_catchment is comprised of separate POLYGONS (buffer_by_catchment$x). 
  # The first step is to “dissolve” away these POLYGONS into one MULTIPOLYGON. 
  # There is no sf equivalent to the QGIS or ArcMap “dissolve” operation. 
  # Instead we use a combination of group_by and summarize from the dplyr package. 
  # Stats::aggregate from sf package, and dplyr::summarize both do essentially the same.
   buffer_pop_aggregated <- buffer_by_catchment |> 
     dplyr::group_by(DN) |>
     dplyr::summarize(buffer_pop_aggregated = round(sum(buffer_pop, na.rm = TRUE)))

  buffer_pop <- merge(catchment.area, st_drop_geometry(buffer_pop_aggregated),
                      by = 'DN', all.x = TRUE)
  
  
  
  return(out = buffer_pop)
  
}

# Invoking the function and calculating proportion of 
# catchment population living within buffers
# 2017 buffer population -------------------------------------------------------
buffer_pop_1km_2017 <- estimate.buffer.pop(
  kasungu_population_2017, 
  buffer_1km_2017, 
   malaria_pop_by_catchment_2017) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |>
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100)) 

buffer_pop_2km_2017 <- estimate.buffer.pop(
  kasungu_population_2017,
  buffer_2km_2017,
  malaria_pop_by_catchment_2017) |>
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |>
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100)) 

buffer_pop_3km_2017 <- estimate.buffer.pop(
  kasungu_population_2017,
  buffer_3km_2017,
  malaria_pop_by_catchment_2017) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |>
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

# 2018 buffer population -------------------------------------------------------
buffer_pop_1km_2018 <- estimate.buffer.pop(
  kasungu_population_2018,
  buffer_1km_2018,
  malaria_pop_by_catchment_2018) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_2km_2018 <- estimate.buffer.pop(
  kasungu_population_2018,
  buffer_2km_2018,
  malaria_pop_by_catchment_2018) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |>
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_3km_2018 <- estimate.buffer.pop(
  kasungu_population_2018,
  buffer_3km_2018,
  malaria_pop_by_catchment_2018) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

# 2019 buffer population -------------------------------------------------------
buffer_pop_1km_2019 <- estimate.buffer.pop(
  kasungu_population_2019,
  buffer_1km_2019,
  malaria_pop_by_catchment_2019) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_2km_2019 <- estimate.buffer.pop(
  kasungu_population_2019,
  buffer_2km_2019,
  malaria_pop_by_catchment_2019) |>
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_3km_2019 <- estimate.buffer.pop(
  kasungu_population_2019,
  buffer_3km_2019,
  malaria_pop_by_catchment_2019) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |>
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

# 2020 buffer population -------------------------------------------------------
buffer_pop_1km_2020 <- estimate.buffer.pop(
  kasungu_population_2020,
  buffer_1km_2020,
  malaria_pop_by_catchment_2020) |>
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_2km_2020 <- estimate.buffer.pop(
  kasungu_population_2020,
  buffer_2km_2020,
  malaria_pop_by_catchment_2020) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

buffer_pop_3km_2020 <- estimate.buffer.pop(
  kasungu_population_2020,
  buffer_3km_2020,
  malaria_pop_by_catchment_2020) |> 
  dplyr::rename(catchment_pop = pop,
                buffer_pop = buffer_pop_aggregated) |> 
  dplyr::mutate(
    prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))

9.1 Mapping proportion of catchment population living within waterbodies

# Helper function to create maps of proportion of people living in proximity ----------
# to water bodies in each catchment area
create.pop.proportion.map <- function(pop.data, 
                                      variable = "prop_buffer_catchment_pop", 
                                      title = NA, 
                                      legend.title = NA){
 
  # pop.data: sf polygon object containing proportion of catchment population 
  #           living within water bodies
  # variable: variable name (as character, in qoutes)
  # title: map title in quotes
  # legend.title: legend title in qoutes
  # returns:
  #   a tmap-element (plots a map)
 
  tm_shape(pop.data)+
    tm_fill(col = variable, 
            breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
            palette = "YlOrBr",
            title = legend.title)+
    tm_borders(lw = 0.3)+
    tm_layout(legend.position = c(0.8,"bottom"),
              legend.text.size = 0.5,
              legend.title.size = 0.7,
              frame = FALSE)+
    tm_credits(title, 
               position = c(0.25, 0.75), 
               size = 1)
}

# Invoking function 
# 2017 population proportion ---------------------------------------------------
pop_proportion_1km_2017_map <- create.pop.proportion.map(
  buffer_pop_1km_2017, 
  title = "2017",
  legend.title = "Population within \n1km buffers (%)")

pop_proportion_2km_2017_map <- create.pop.proportion.map(
  buffer_pop_2km_2017, 
  title = "2017",
  legend.title = "Population within \n2km buffers (%)")

pop_proportion_3km_2017_map <- create.pop.proportion.map(
  buffer_pop_3km_2017,
  title = "2017",
  legend.title = "Population within \n3km buffers (%)")

# 2018 population proportion ---------------------------------------------------
pop_proportion_1km_2018_map <- create.pop.proportion.map(
  buffer_pop_1km_2018,
  title = "2018",
  legend.title = "Population within \n1km buffers (%)")

pop_proportion_2km_2018_map <- create.pop.proportion.map(
  buffer_pop_2km_2018,
  title = "2018",
  legend.title = "Population within \n2km buffers (%)")

pop_proportion_3km_2018_map <- create.pop.proportion.map(
  buffer_pop_3km_2018,
  title = "2018",
  legend.title = "Population within \n3km buffers (%)")

# 2019 population proportion ---------------------------------------------------
pop_proportion_1km_2019_map <- create.pop.proportion.map(
  buffer_pop_1km_2019,
  title = "2019",
  legend.title = "Population within \n1km buffers (%)")

pop_proportion_2km_2019_map <- create.pop.proportion.map(
  buffer_pop_2km_2019,
  title = "2019",
  legend.title = "Population within \n2km buffers (%)")

pop_proportion_3km_2019_map <- create.pop.proportion.map(
  buffer_pop_3km_2019,
  title = "2019",
  legend.title = "Population within \n3km buffers (%)")

# 2020 population proportion ---------------------------------------------------
pop_proportion_1km_2020_map <- create.pop.proportion.map(
  buffer_pop_1km_2020,
  title = "2020",
  legend.title = "Population within \n1km buffers (%)")

pop_proportion_2km_2020_map <- create.pop.proportion.map(
  buffer_pop_2km_2020,
  title = "2020",
  legend.title = "Population within \n2km buffers (%)")

pop_proportion_3km_2020_map <- create.pop.proportion.map(
  buffer_pop_3km_2020,
  title = "2020",
  legend.title = "Population within \n3km buffers (%)")

# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(pop_proportion_1km_2017_map, pop_proportion_2km_2017_map, 
                   pop_proportion_3km_2017_map, pop_proportion_1km_2018_map,
                   pop_proportion_2km_2018_map, pop_proportion_3km_2018_map,
                   pop_proportion_1km_2019_map, pop_proportion_2km_2019_map,
                   pop_proportion_3km_2019_map, pop_proportion_1km_2020_map,
                   pop_proportion_2km_2020_map, pop_proportion_3km_2020_map, ncol = 3)
Fig. 10. Proportion of catchment population living around water bodies

Fig. 10. Proportion of catchment population living around water bodies

10 Scatter plots of SMR against the proportion of the catchment population living waterbody buffers

A correlation coeeficient of more than zero (cor.coeff r > 0.1) indicates some positive association between the SMR and the buffer population variables. That is, SMR of dry season malaria increases with increase in number of people surrounding water bodies.

# Helper function to tidy and bind the SMR and proportion of -------------------
# buffer-catchment population data frames
tidy.data <- function(smr.df, 
                      proportion.pop.1km, 
                      proprotion.pop.2km,
                      proportion.pop.3km){

# Convert the sf objects to data frames-------------------------------------------
smr_df <- as.data.frame(smr.df) |> 
  dplyr::select(rowID, Names, SMR)

proportion_pop_1km_df <- as.data.frame(proportion.pop.1km) |> 
  dplyr::select(rowID, prop_pop_1km = `prop_buffer_catchment_pop`)

proportion_pop_2km_df <- as.data.frame(proprotion.pop.2km) |> 
  dplyr::select(rowID, prop_pop_2km = `prop_buffer_catchment_pop`)

proportion_pop_3km_df <- as.data.frame(proportion.pop.3km) |>
  dplyr::select(rowID, prop_pop_3km = `prop_buffer_catchment_pop`)

# Merge SMR and population data frames -----------------------------------------
combined_1 <- merge(smr_df, proportion_pop_1km_df, by = "rowID", all = TRUE)

combined_2 <- merge(proportion_pop_2km_df, proportion_pop_3km_df)

combined_fully <- merge(combined_1, combined_2, by = "rowID", all = TRUE)

}

# Invoking the function --------------------------------------------------------
smr_pop_2017 <- tidy.data(SMR_2017, buffer_pop_1km_2017, buffer_pop_2km_2017, buffer_pop_3km_2017)

smr_pop_2018 <- tidy.data(SMR_2018, buffer_pop_1km_2018, buffer_pop_2km_2018, buffer_pop_3km_2018)

smr_pop_2019 <- tidy.data(SMR_2019, buffer_pop_1km_2019, buffer_pop_2km_2019, buffer_pop_3km_2019)

smr_pop_2020 <- tidy.data(SMR_2020, buffer_pop_1km_2020, buffer_pop_2km_2020, buffer_pop_3km_2020)

# Helper function to create scatter plots --------------------------------------
create.scatter.plot <- function(smr.pop.df, 
                                independent.var = NA,
                                dependent.var = "SMR",
                                x.label = NA,
                                plot.title = NA){
  
  scatter.plot <- ggpubr::ggscatter(smr.pop.df,          # data frame
                                    x = independent.var, # x-axis variable
                                    y = dependent.var,   # y-axis variable
                                    add = "reg.line",    # Add regression line
                                    conf.int = TRUE,     # Add confidence interval
                                    add.params = list(color = "red",
                                                      fill = "lightgray"),
                                    palette = "jco",     # journal color palette. see ?ggpar
                                    xlab = x.label,      # x-axis label
                                    ylab = "SMR",        # y-axis label
                                    title = plot.title)+    
                  ggpubr::stat_cor(label.y = 3)+         # Add correlation coefficient
                  ggpubr::font("title", size = 10, face = "bold")+
                  ggpubr::font("xlab", size = 10)+
                  ggpubr::font("ylab", size = 10)
 
  return(scatter.plot)
  
}

# Invoking function 
# 2017 scatter plots ------------------------------------------------------------

scatter_1km_2017 <- create.scatter.plot(
  smr_pop_2017, independent.var = "prop_pop_1km",
  x.label = "Percentage of catchment population \nliving in 1km buffer",
  plot.title = "2017")

scatter_2km_2017 <- create.scatter.plot(
  smr_pop_2017, independent.var = "prop_pop_2km",
  x.label = "Percentage of catchment population \nliving in 2km buffer",
  plot.title = "2017")

scatter_3km_2017 <- create.scatter.plot(
  smr_pop_2017, independent.var = "prop_pop_3km",
  x.label = "Percentage of catchment population \nliving in 3km buffer",
  plot.title = "2017")

# 2018 scatter plots -----------------------------------------------------------
scatter_1km_2018 <- create.scatter.plot(
  smr_pop_2018, independent.var = "prop_pop_1km",
  x.label = "Percentage of catchment population \nliving in 1km buffer",
  plot.title = "2018")

scatter_2km_2018 <- create.scatter.plot(
  smr_pop_2018, independent.var = "prop_pop_2km",
  x.label = "Percentage of catchment population \nliving in 2km buffer",
  plot.title = "2018")

scatter_3km_2018 <- create.scatter.plot(
  smr_pop_2018, independent.var = "prop_pop_3km",
  x.label = "Percentage of catchment population \nliving in 3km buffer",
  plot.title = "2018")

# 2019 scatter plots -----------------------------------------------------------
scatter_1km_2019 <- create.scatter.plot(
  smr_pop_2019, independent.var = "prop_pop_1km",
  x.label = "Percentage of catchment population \nliving in 1km buffer",
  plot.title = "2019")

scatter_2km_2019 <- create.scatter.plot(
  smr_pop_2019, independent.var = "prop_pop_2km",
  x.label = "Percentage of catchment population \nliving in 2km buffer",
  plot.title = "2019")

scatter_3km_2019 <- create.scatter.plot(
  smr_pop_2019, independent.var = "prop_pop_3km",
  x.label = "Percentage of catchment population \nliving in 3km buffer",
  plot.title = "2019")

# 2020 scatter plots -----------------------------------------------------------
scatter_1km_2020 <- create.scatter.plot(
  smr_pop_2020, independent.var = "prop_pop_1km",
  x.label = "Percentage of catchment population \nliving in 1km buffer",
  plot.title = "2020")

scatter_2km_2020 <- create.scatter.plot(
  smr_pop_2020, independent.var = "prop_pop_2km",
  x.label = "Percentage of catchment population \nliving in 2km buffer",
  plot.title = "2020")

scatter_3km_2020 <- create.scatter.plot(
  smr_pop_2020, independent.var = "prop_pop_3km",
  x.label = "Percentage of catchment population \nliving in 3km buffer",
  plot.title = "2020")

# Arrange the plots ------------------------------------------------------------
ggpubr::ggarrange(scatter_1km_2017, scatter_2km_2017, scatter_3km_2017,
                  scatter_1km_2018, scatter_2km_2018, scatter_3km_2018,
                  scatter_1km_2019, scatter_2km_2019, scatter_3km_2019, 
                  scatter_1km_2020, scatter_2km_2020, scatter_3km_2020,
                  ncol = 3, nrow = 4)
Fig 11. Relationship between standardised morbidity ratio and living near waterbodies

Fig 11. Relationship between standardised morbidity ratio and living near waterbodies

11 Model fitting

\[ ln (E(y)) = 𝜷_0 + 𝜷_1 x + ln(𝒆_𝒊)\] where, dependent variable, 𝑦 = observed malaria cases; 𝐸(𝑦) = expected count value; independent variable, 𝑥 = percentage of people living near dams; and offset, 𝑒 = expected malaria cases.

To cope with the malaria count data coming from populations of different sizes, we specify an offset argument. This adds a constant term for each row of the data in the model. The log of the population is used in the offset term.

Summary outputs:

Estimate : the intercept (\(𝜷_0\)) and the beta coefficient estimates associated to each predictor variable.

Std.Error : the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.

t value : the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3). For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, i.e., whether the beta coefficient of the predictor is significantly different from zero.

Pr(>|t|) : The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.

Residuals : Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum (min)and maximum (max) should be roughly equal in absolute value.

Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.

Residual standard error (RSE), and R-squared (\(R^2\)) metrics tell how well the model fits to our data. An (adjusted) \(R^2\) that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model. A number near 0 indicates that the regression model did not explain much of the variability in the outcome.

# Combine data for model fitting -----------------------------------------------
model_data_2017 <- merge(expected_malaria_2017, smr_pop_2017, by = "rowID", all = TRUE) |>
  dplyr::select(-Names.y) |> 
  dplyr::rename(Names = Names.x)

model_data_2018 <-  merge(expected_malaria_2018, smr_pop_2018, by = "rowID", all = TRUE) |> 
  dplyr::select(-Names.y) |> 
  dplyr::rename(Names = Names.x)

model_data_2019 <-  merge(expected_malaria_2019, smr_pop_2019, by = "rowID", all = TRUE) |>
  dplyr::select(-Names.y) |> 
  dplyr::rename(Names = Names.x)

model_data_2020 <-  merge(expected_malaria_2020, smr_pop_2020, by = "rowID", all = TRUE) |> 
  dplyr::select(-Names.y) |>
  dplyr::rename(Names = Names.x)

# Normality test ---------------------------------------------------------------
# Check whether the dependent variable follows a poisson distribution
# Dry season malaria cases appear to be non normally distributed i.e highly skewed 

graphics::hist(model_data_2017$observed_2017, 
               main = "Histogram of 2017 dry season malaria cases ",
               xlab = "Malaria cases",
               las = 1)

graphics::hist(model_data_2018$observed_2018, 
               main = "Histogram of 2018 dry season malaria cases ",
               xlab = "Malaria cases",
               las = 1)

graphics::hist(model_data_2019$observed_2019,
               main = "Histogram of 2019 dry season malaria cases ",
               xlab = "Malaria cases",
               las = 1)

graphics::hist(model_data_2020$observed_2020, 
               main = "Histogram of 2020 dry season malaria cases ",
               xlab = "Malaria cases",
               las = 1)

# Fit generalised linear model -------------------------------------------------
# Defining model parameters:
# response variable: observed_2017, observed_2018, observed_2019, observed_2020 are
#                    recorded dry season malaria cases in that year
# risk factor: prop_pop_1km,  prop_pop_2km,  prop_pop_3km are the percentage of people living
#             within 1km, 2km and 3km buffers of water bodies, respectively.
# offset: expected_* is the number of malaria cases we would expect if the malaria rate
#          was equal in all the catchment areas

# 2017 -------------------------------------------------------------------------
model_1km_2017 <- glm(observed_2017~prop_pop_1km+offset(log(expected_2017)),
                      data = model_data_2017, family = 'poisson')

# Display the statistical summary of the model
summary(model_1km_2017)
## 
## Call:
## glm(formula = observed_2017 ~ prop_pop_1km + offset(log(expected_2017)), 
##     family = "poisson", data = model_data_2017)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -33.958  -13.752   -5.112   17.353   41.517  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3039740  0.0076437  -39.77   <2e-16 ***
## prop_pop_1km  0.0335505  0.0006191   54.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12380.6  on 22  degrees of freedom
## Residual deviance:  9494.1  on 21  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 9712.8
## 
## Number of Fisher Scoring iterations: 4
# The estimated regression equation can be written as follow: 
# observed malaria cases = -0.2263102 + 0.0247164 * percentage of catchment population living near dam 
# Using this formula, for each change in number of people living near dams,
# we can predict the number of dry season malaria case.
#
# For example:
#   for a buffer population equal zero, we can expect -0.22 malaria cases. 
#   for a buffer population equal 1000, we can expect  round(-0.22 + 0.02 * 1000) = 20 malaria cases.

sjPlot::tab_model(model_1km_2017, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2017
Predictors Incidence Rate Ratios CI p
(Intercept) 0.738 0.727 – 0.749 <0.001
prop_pop_1km 1.034 1.033 – 1.035 <0.001
Observations 23
AIC 9712.787
# From the output above, the R2 is 1, meaning that the observed and the predicted 
# outcome values are highly correlated, which is very good. 

model_2km_2017 <- glm(observed_2017~1+prop_pop_2km+offset(log(expected_2017)),
                      data = model_data_2017, family = 'poisson')

summary(model_2km_2017)
## 
## Call:
## glm(formula = observed_2017 ~ 1 + prop_pop_2km + offset(log(expected_2017)), 
##     family = "poisson", data = model_data_2017)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -37.84  -16.11   -3.29   15.01   41.07  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3605299  0.0080842   -44.6   <2e-16 ***
## prop_pop_2km  0.0152227  0.0002666    57.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12472.8  on 23  degrees of freedom
## Residual deviance:  9336.9  on 22  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 9565
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_2km_2017, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2017
Predictors Incidence Rate Ratios CI p
(Intercept) 0.697 0.686 – 0.708 <0.001
prop_pop_2km 1.015 1.015 – 1.016 <0.001
Observations 24
AIC 9565.008
model_3km_2017 <- glm(observed_2017~1+prop_pop_3km+offset(log(expected_2017)),
                      data = model_data_2017, family = 'poisson')

summary(model_3km_2017)
## 
## Call:
## glm(formula = observed_2017 ~ 1 + prop_pop_3km + offset(log(expected_2017)), 
##     family = "poisson", data = model_data_2017)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.722  -13.016   -2.585   13.434   43.672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.437453   0.009708  -45.06   <2e-16 ***
## prop_pop_3km  0.011086   0.000206   53.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12472.8  on 23  degrees of freedom
## Residual deviance:  9561.6  on 22  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 9789.7
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_3km_2017, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2017
Predictors Incidence Rate Ratios CI p
(Intercept) 0.646 0.633 – 0.658 <0.001
prop_pop_3km 1.011 1.011 – 1.012 <0.001
Observations 24
AIC 9789.686
# 2018 -------------------------------------------------------------------------
model_1km_2018 <- glm(observed_2018~1+prop_pop_1km+offset(log(expected_2018)),
                      data = model_data_2018, family = 'poisson')

summary(model_1km_2018)
## 
## Call:
## glm(formula = observed_2018 ~ 1 + prop_pop_1km + offset(log(expected_2018)), 
##     family = "poisson", data = model_data_2018)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -38.977  -17.683    3.737   15.352   44.012  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3098187  0.0070954  -43.66   <2e-16 ***
## prop_pop_1km  0.0246243  0.0004545   54.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16294  on 24  degrees of freedom
## Residual deviance: 13449  on 23  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 13693
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_1km_2018, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2018
Predictors Incidence Rate Ratios CI p
(Intercept) 0.734 0.723 – 0.744 <0.001
prop_pop_1km 1.025 1.024 – 1.026 <0.001
Observations 25
AIC 13693.117
model_2km_2018 <- glm(observed_2018~1+prop_pop_2km+offset(log(expected_2018)),
                      data = model_data_2018, family = 'poisson')

summary(model_2km_2018)
## 
## Call:
## glm(formula = observed_2018 ~ 1 + prop_pop_2km + offset(log(expected_2018)), 
##     family = "poisson", data = model_data_2018)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -39.669  -18.962    4.564   13.117   42.979  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3287476  0.0081734  -40.22   <2e-16 ***
## prop_pop_2km  0.0099172  0.0002113   46.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16294  on 24  degrees of freedom
## Residual deviance: 14086  on 23  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 14330
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_2km_2018, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2018
Predictors Incidence Rate Ratios CI p
(Intercept) 0.720 0.708 – 0.731 <0.001
prop_pop_2km 1.010 1.010 – 1.010 <0.001
Observations 25
AIC 14329.762
model_3km_2018 <- glm(observed_2018~1+prop_pop_3km+offset(log(expected_2018)),
                      data = model_data_2018, family = 'poisson')

summary(model_3km_2018)
## 
## Call:
## glm(formula = observed_2018 ~ 1 + prop_pop_3km + offset(log(expected_2018)), 
##     family = "poisson", data = model_data_2018)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -42.080  -18.542    1.207   14.162   40.782  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4616226  0.0105724  -43.66   <2e-16 ***
## prop_pop_3km  0.0089899  0.0001871   48.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16294  on 24  degrees of freedom
## Residual deviance: 13916  on 23  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 14160
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_3km_2018, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2018
Predictors Incidence Rate Ratios CI p
(Intercept) 0.630 0.617 – 0.643 <0.001
prop_pop_3km 1.009 1.009 – 1.009 <0.001
Observations 25
AIC 14159.892
# 2019 -------------------------------------------------------------------------
model_1km_2019 <- glm(observed_2019~1+prop_pop_1km+offset(log(expected_2019)),
                      data = model_data_2019, family = 'poisson')

summary(model_1km_2019)
## 
## Call:
## glm(formula = observed_2019 ~ 1 + prop_pop_1km + offset(log(expected_2019)), 
##     family = "poisson", data = model_data_2019)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.345  -10.045    2.782   10.830   38.276  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.1858493  0.0082288  -22.59   <2e-16 ***
## prop_pop_1km  0.0101514  0.0003897   26.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10738  on 25  degrees of freedom
## Residual deviance: 10063  on 24  degrees of freedom
## AIC: 10314
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_1km_2019, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2019
Predictors Incidence Rate Ratios CI p
(Intercept) 0.830 0.817 – 0.844 <0.001
prop_pop_1km 1.010 1.009 – 1.011 <0.001
Observations 26
AIC 10313.731
model_2km_2019 <- glm(observed_2019~1+prop_pop_2km+offset(log(expected_2019)),
                      data = model_data_2019, family = 'poisson')

summary(model_2km_2019)
## 
## Call:
## glm(formula = observed_2019 ~ 1 + prop_pop_2km + offset(log(expected_2019)), 
##     family = "poisson", data = model_data_2019)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.706  -13.737    1.651   13.668   33.174  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.0788997  0.0095633  -8.250   <2e-16 ***
## prop_pop_2km  0.0017563  0.0001943   9.037   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10738  on 25  degrees of freedom
## Residual deviance: 10656  on 24  degrees of freedom
## AIC: 10907
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_2km_2019, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2019
Predictors Incidence Rate Ratios CI p
(Intercept) 0.924 0.907 – 0.942 <0.001
prop_pop_2km 1.002 1.001 – 1.002 <0.001
Observations 26
AIC 10906.895
model_3km_2019 <- glm(observed_2019~1+prop_pop_3km+offset(log(expected_2019)),
                      data = model_data_2019, family = 'poisson')

summary(model_3km_2019)
## 
## Call:
## glm(formula = observed_2019 ~ 1 + prop_pop_3km + offset(log(expected_2019)), 
##     family = "poisson", data = model_data_2019)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.991  -13.901    1.614   13.634   32.809  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.0733555  0.0114347  -6.415 1.41e-10 ***
## prop_pop_3km  0.0010803  0.0001584   6.821 9.03e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10738  on 25  degrees of freedom
## Residual deviance: 10691  on 24  degrees of freedom
## AIC: 10942
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_3km_2019, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2019
Predictors Incidence Rate Ratios CI p
(Intercept) 0.929 0.909 – 0.950 <0.001
prop_pop_3km 1.001 1.001 – 1.001 <0.001
Observations 26
AIC 10942.072
# 2020 -------------------------------------------------------------------------
model_1km_2020 <- glm(observed_2020~1+prop_pop_1km+offset(log(expected_2020)),
                      data = model_data_2020, family = 'poisson')

summary(model_1km_2020)
## 
## Call:
## glm(formula = observed_2020 ~ 1 + prop_pop_1km + offset(log(expected_2020)), 
##     family = "poisson", data = model_data_2020)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -69.407  -16.175    0.429   19.154   48.059  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.0074094  0.0058680  -1.263    0.207
## prop_pop_1km  0.0002576  0.0004602   0.560    0.576
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 19142  on 21  degrees of freedom
## Residual deviance: 19142  on 20  degrees of freedom
##   (4 observations deleted due to missingness)
## AIC: 19365
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_1km_2020, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2020
Predictors Incidence Rate Ratios CI p
(Intercept) 0.993 0.981 – 1.004 0.207
prop_pop_1km 1.000 0.999 – 1.001 0.576
Observations 22
AIC 19365.478
model_2km_2020 <- glm(observed_2020~1+prop_pop_2km+offset(log(expected_2020)),
                      data = model_data_2020, family = 'poisson')

summary(model_2km_2020)
## 
## Call:
## glm(formula = observed_2020 ~ 1 + prop_pop_2km + offset(log(expected_2020)), 
##     family = "poisson", data = model_data_2020)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -68.557  -20.832    3.906   21.701   46.233  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.1058128  0.0064564  -16.39   <2e-16 ***
## prop_pop_2km  0.0030881  0.0002112   14.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20251  on 22  degrees of freedom
## Residual deviance: 20037  on 21  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 20270
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_2km_2020, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2020
Predictors Incidence Rate Ratios CI p
(Intercept) 0.900 0.888 – 0.911 <0.001
prop_pop_2km 1.003 1.003 – 1.004 <0.001
Observations 23
AIC 20270.221
model_3km_2020 <- glm(observed_2020~1+prop_pop_3km+offset(log(expected_2020)),
                      data = model_data_2020, family = 'poisson')

summary(model_3km_2020)
## 
## Call:
## glm(formula = observed_2020 ~ 1 + prop_pop_3km + offset(log(expected_2020)), 
##     family = "poisson", data = model_data_2020)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -69.691  -20.630    4.824   21.584   45.976  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.1348562  0.0073906  -18.25   <2e-16 ***
## prop_pop_3km  0.0025381  0.0001518   16.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20251  on 22  degrees of freedom
## Residual deviance: 19968  on 21  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 20201
## 
## Number of Fisher Scoring iterations: 4
sjPlot::tab_model(model_3km_2020, digits = 3, digits.re = 3,
                  show.r2 = FALSE, show.aic = TRUE)
  observed_2020
Predictors Incidence Rate Ratios CI p
(Intercept) 0.874 0.861 – 0.887 <0.001
prop_pop_3km 1.003 1.002 – 1.003 <0.001
Observations 23
AIC 20201.432

12 Assess the performance and accuracy of the regression model.

Here, we evaluate how well the Poisson regression model is predicting the outcome of a new test data that have not been used to build the model i.e how close the prediction is close to the real value.

Two important metrics have been used to assess the performance of the predictive regression model:

Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model.RMSE is computed as RMSE = mean((observed - predicted)^2) |> sqrt(). The lower the RMSE, the better the model.

\[RMSE = \sqrt \frac {\Sigma ( y - \hat{y} )^2}{N}\]

Pseudo R-squared goodness-of-fit measure for count data in Poisson regression which is nonlinear computed using deviance statistics:

\[D=2∑ni=1{Yilog(Yi/μi)−(Yi−μi)}\] where \[μi=exp(β^0+β^1X1+...+β^pXp)\] denotes the predicted mean for observation i based on the estimated model parameters. AJKOER (https://stats.stackexchange.com/users/54013/ajkoer), Basic R-Squared in Poisson Regression, URL (version: 2020-06-28): https://stats.stackexchange.com/q/474500

# Tidy model assessment data ---------------------------------------------------
model_assessment_data_2017 <- model_data_2017 |>
  dplyr::as_tibble() |>
  dplyr::select(Names, observed_2017, expected_2017,
                prop_pop_1km, prop_pop_2km, prop_pop_3km) |>
  dplyr::glimpse()
## Rows: 26
## Columns: 6
## $ Names         <chr> "Lodjwa Health Centre", "Nkhamenya Rural Hospital", "New~
## $ observed_2017 <int> 564, 2720, 216, 1523, 1480, 1159, 1930, 3482, 2970, 594,~
## $ expected_2017 <dbl> 826, 3344, 1156, 2287, 3327, 2249, 2340, 2324, 2976, 183~
## $ prop_pop_1km  <dbl> 10, 4, 1, NA, 2, 2, NA, 7, 2, 8, 6, 9, 2, 21, 11, NA, 21~
## $ prop_pop_2km  <dbl> 7, 16, 7, NA, 6, 8, 2, 23, 8, 20, 19, 29, 9, 53, 30, NA,~
## $ prop_pop_3km  <dbl> 19, 30, 14, NA, 19, 17, 2, 45, 16, 40, 39, 54, 16, 80, 4~
model_assessment_data_2018 <- model_data_2018 |>
  dplyr::as_tibble() |>
  dplyr::select(Names, observed_2018, expected_2018,
                prop_pop_1km, prop_pop_2km, prop_pop_3km) |>
  dplyr::glimpse()
## Rows: 26
## Columns: 6
## $ Names         <chr> "Lodjwa Health Centre", "Nkhamenya Rural Hospital", "New~
## $ observed_2018 <int> 1151, 3343, 434, 2616, 1715, 1506, 1773, 3330, 3480, 112~
## $ expected_2018 <dbl> 934, 3785, 1295, 2589, 3804, 2496, 2636, 2621, 3420, 204~
## $ prop_pop_1km  <dbl> NA, 3, 1, 4, 3, 3, 2, 11, 9, 8, 9, 10, 3, 20, 14, 7, 25,~
## $ prop_pop_2km  <dbl> NA, 6, 7, 16, 11, 10, 5, 39, 27, 25, 22, 37, 12, 54, 33,~
## $ prop_pop_3km  <dbl> NA, 13, 16, 31, 30, 23, 10, 65, 51, 54, 44, 61, 22, 80, ~
model_assessment_data_2019 <- model_data_2019 |>
  dplyr::as_tibble() |>
  dplyr::select(Names, observed_2019, expected_2019,
                prop_pop_1km, prop_pop_2km, prop_pop_3km) |>
  dplyr::glimpse()
## Rows: 26
## Columns: 6
## $ Names         <chr> "Lodjwa Health Centre", "Nkhamenya Rural Hospital", "New~
## $ observed_2019 <int> 1168, 3932, 626, 4169, 2504, 1788, 2124, 3537, 3357, 435~
## $ expected_2019 <dbl> 909, 3709, 1266, 2523, 3751, 2405, 2576, 2547, 3405, 196~
## $ prop_pop_1km  <dbl> 10, 10, 5, 3, 7, 11, 5, 27, 10, 17, 10, 22, 2, 21, 14, 8~
## $ prop_pop_2km  <dbl> 3, 29, 18, 11, 17, 32, 20, 60, 35, 43, 22, 64, 13, 54, 3~
## $ prop_pop_3km  <dbl> 10, 58, 44, 19, 36, 55, 36, 86, 54, 72, 42, 85, 18, 81, ~
model_assessment_data_2020 <- model_data_2020 |>
  dplyr::as_tibble() |>
  dplyr::select(Names, observed_2020, expected_2020,
                prop_pop_1km, prop_pop_2km, prop_pop_3km) |>
  dplyr::glimpse()
## Rows: 26
## Columns: 6
## $ Names         <chr> "Lodjwa Health Centre", "Nkhamenya Rural Hospital", "New~
## $ observed_2020 <int> 1788, 8539, 2182, 5186, 6117, 5310, 2323, 7160, 7297, 10~
## $ expected_2020 <dbl> 1538, 6313, 2153, 4270, 6426, 4026, 4379, 4308, 5904, 32~
## $ prop_pop_1km  <dbl> 1, 5, 2, NA, 1, 3, NA, 13, NA, 7, 7, 9, 2, 21, 11, NA, 2~
## $ prop_pop_2km  <dbl> 4, 20, 8, NA, 2, 10, 3, 36, NA, 17, 20, 34, 8, 53, 28, N~
## $ prop_pop_3km  <dbl> 9, 42, 17, NA, 5, 22, 2, 54, NA, 38, 40, 58, 14, 80, 45,~
# Note to self: do LOOCV(Leave One Out Cross-Validation) and k-fold Cross Validation
# Split the data into training and test set ------------------------------------
set.seed(2) # generate random numbers

split_2017 <- caTools::sample.split(model_assessment_data_2017, 
                                    SplitRatio = 0.8) # use 80% of the data for training

train_2017 <- subset(model_assessment_data_2017, split = "TRUE")

test_2017 <- subset(model_assessment_data_2017, split = "FALSE")

# Build model
model_2017 <- glm(observed_2017~1+prop_pop_1km+offset(log(expected_2017)),
                      data = train_2017, family = 'poisson')

summary(model_2017)
## 
## Call:
## glm(formula = observed_2017 ~ 1 + prop_pop_1km + offset(log(expected_2017)), 
##     family = "poisson", data = train_2017)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -33.958  -13.752   -5.112   17.353   41.517  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3039740  0.0076437  -39.77   <2e-16 ***
## prop_pop_1km  0.0335505  0.0006191   54.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12380.6  on 22  degrees of freedom
## Residual deviance:  9494.1  on 21  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 9712.8
## 
## Number of Fisher Scoring iterations: 4
# Prediction
predicted_2017 <- predict(model_2017, test_2017, type = "response")

# Compare predicted vs actual dry season malaria cases
par(mfrow = c(2, 1))

plot(test_2017$observed_2017, type = "l", lty = 1.8, col = "blue")

plot(predicted_2017, type = "l", lty = 1.8, col = "red", add = TRUE)

par(mfrow = c(1, 1)) # Create a 2 x 2 plotting matrix


# Split the data into training and test set 
set.seed(123) # generate a sequence of random numbers
# 2017 -------------------------------------------------------------------------
training_samples_2017 <- model_assessment_data_2017$observed_2017 |>
  caret::createDataPartition(p = 0.8, list = FALSE) # use 80% of the data for training

train_data_2017  <- model_assessment_data_2017[training_samples_2017, ]

test_data_2017 <- model_assessment_data_2017[-training_samples_2017, ]

# 2018 
training_samples_2018 <- model_assessment_data_2018$observed_2018 |>
  caret::createDataPartition(p = 0.8, list = FALSE) # see ?createDataPartition

train_data_2018 <- model_assessment_data_2018[training_samples_2018, ]

test_data_2018 <- model_assessment_data_2018[-training_samples_2018, ]

# 2019 
training_samples_2019 <- model_assessment_data_2019$observed_2019 |>
  caret::createDataPartition(p = 0.8, list = FALSE)

train_data_2019 <- model_assessment_data_2019[training_samples_2019, ]

test_data_2019 <- model_assessment_data_2019[-training_samples_2019, ]

# 2020 
training_samples_2020 <- model_assessment_data_2020$observed_2020 |>
  caret::createDataPartition(p = 0.8, list = FALSE)

train_data_2020 <- model_assessment_data_2020[training_samples_2020, ]

test_data_2020 <- model_assessment_data_2020[-training_samples_2020, ]

# Make predictions using the test data in order to evaluate the performance 
# of our regression model aka goodness-of-fit. The "response" type of prediction
# is on the scale of the response variable. Thus for a default binomial model 
# the default predictions are of log-odds (probabilities on logit scale) and 
# type = "response" gives the predicted probabilities. 
# 2017 -------------------------------------------------------------------------
predictions_1km_2017 <- model_1km_2017 |>
  stats::predict.glm(test_data_2017, type = "response") # see ?stats::predict.glm

predictions_2km_2017 <- model_2km_2017 |>
  stats::predict.glm(test_data_2017, type = "response")

# 2018 
predictions_1km_2018 <- model_1km_2018 |>
  stats::predict.glm(test_data_2018, type = "response")

predictions_2km_2018 <- model_2km_2018 |>
  stats::predict.glm(test_data_2018, type = "response")

# 2019 
predictions_1km_2019 <- model_1km_2019 |>
  stats::predict.glm(test_data_2019, type = "response")

predictions_2km_2019 <- model_2km_2019 |>
  stats::predict.glm(test_data_2019, type = "response")

# 2020 
predictions_1km_2020 <- model_1km_2020 |>
  stats::predict.glm(test_data_2020, type = "response")

predictions_2km_2020 <- model_2km_2020 |>
  stats::predict.glm(test_data_2020, type = "response")

# Model performance ------------------------------------------------------------
# (a) Compute the prediction error, RMSE. The lower the RMSE, the better the model
# 2017
caret::RMSE(predictions_1km_2017, test_data_2017$observed_2017)
## [1] NA
caret::RMSE(predictions_2km_2017, test_data_2017$observed_2017)
## [1] NA
# 2018 
caret::RMSE(predictions_1km_2018, test_data_2018$observed_2018)
## [1] 1817.572
caret::RMSE(predictions_2km_2018, test_data_2018$observed_2018)
## [1] 1698.287
# 2019 
caret::RMSE(predictions_1km_2019, test_data_2019$observed_2019)
## [1] 1376.436
caret::RMSE(predictions_2km_2019, test_data_2019$observed_2019)
## [1] 1348.131
# 2020 
caret::RMSE(predictions_1km_2020, test_data_2020$observed_2020)
## [1] 2382.515
caret::RMSE(predictions_2km_2020, test_data_2020$observed_2020)
## [1] 2360.826
# (b) Compute pseudo R-square 
# We express the goodness of fit of the Poisson regression model by widely 
# used variants of pseudo R squared statistics, most of which are based on 
# the deviance of the model: 
#    - the Aldrich-Nelson pseudo-R2 with the Veall-Zimmermann correction, which
#      is the best approximation of the McKelvey-Zavoina.
# Efron, Aldrich-Nelson, McFadden and Nagelkerke approaches severely underestimate 
# the "true R2". -----------------------------------------------------------------

# DescTools::PseudoR2(model_1km_2017, "all")

round(DescTools::PseudoR2(model_1km_2017, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.99         9712.79         2886.48            0.99
round(DescTools::PseudoR2(model_2km_2017, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.99         9565.01         3135.92            0.99
round(DescTools::PseudoR2(model_1km_2018, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.99        13693.12         2844.21            0.99
round(DescTools::PseudoR2(model_2km_2018, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.99        14329.76         2207.56            0.99
round(DescTools::PseudoR2(model_1km_2019, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.96        10313.73          675.17            0.97
round(DescTools::PseudoR2(model_2km_2019, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.76        10906.89           82.01            0.76
round(DescTools::PseudoR2(model_1km_2020, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.01        19365.48            0.31            0.01
round(DescTools::PseudoR2(model_2km_2020, c("AldrichNelson", "AIC", "LogLikNull",
                                            "LogLik", "G2", "VeallZimmermann")), 2)
##   AldrichNelson             AIC              G2 VeallZimmermann 
##            0.90        20270.22          213.26            0.90

12.0.1 Check for collinearity, normality or heteroscedasticity

# Check for collinearity, normality or heteroscedasticity --------------------

performance::check_model(model_1km_2017, theme = "see::theme_modern")

performance::check_model(model_2km_2017, theme = "see::theme_modern")

performance::check_model(model_1km_2018, theme = "see::theme_modern")

performance::check_model(model_2km_2018, theme = "see::theme_modern")

performance::check_model(model_1km_2019, theme = "see::theme_modern")

performance::check_model(model_2km_2019, theme = "see::theme_modern")

performance::check_model(model_1km_2020, theme = "see::theme_modern")

performance::check_model(model_2km_2020, theme = "see::theme_modern")

# Model performance summaries --------------------------------------------------
# Compute indices of model performance for the regression models and 
# compare the quality of the models.
# Note that all score value do not necessarily sum up to 100%. See ?compare_performance
model_comparison <- performance::compare_performance(model_1km_2017, model_2km_2017,
                                                     model_1km_2018, model_2km_2018, 
                                                     model_1km_2019, model_2km_2019,
                                                     model_1km_2020, model_2km_2020, 
                                                     metrics = "common", rank = TRUE)

model_comparison
## # Comparison of Model Performance Indices
## 
## Name           | Model |       AIC |       BIC | Nagelkerke's R2 |     RMSE | Performance-Score
## -----------------------------------------------------------------------------------------------
## model_2km_2017 |   glm |  9565.008 |  9567.364 |           1.000 |  943.312 |           100.00%
## model_1km_2017 |   glm |  9712.787 |  9715.058 |           1.000 |  994.542 |            97.91%
## model_1km_2019 |   glm | 10313.731 | 10316.247 |           1.000 |  952.120 |            96.26%
## model_2km_2019 |   glm | 10906.895 | 10909.411 |           0.957 |  972.461 |            91.85%
## model_1km_2018 |   glm | 13693.117 | 13695.555 |           1.000 | 1363.562 |            69.20%
## model_2km_2018 |   glm | 14329.762 | 14332.200 |           1.000 | 1394.218 |            65.38%
## model_2km_2020 |   glm | 20270.221 | 20272.492 |           1.000 | 1855.248 |            25.00%
## model_1km_2020 |   glm | 19365.478 | 19367.661 |           0.014 | 1836.011 |             4.75%

12.0.2 Visualization of indices of models’ performance

plot(performance::compare_performance(model_1km_2017, model_2km_2017,
                                      model_1km_2018, model_2km_2018, 
                                      model_1km_2019, model_2km_2019,
                                      model_1km_2020, model_2km_2020, 
                                      metrics = "common", rank = TRUE))

12.0.3 Check how well the fitted values line up with the observations

The fitted values appear to line up particularly well with the observed data, suggesting that prop_pop_* (i.e., proportion of catchment population living near water bodies) can help us understand malaria risk in the catchment areas.

# Helper function to create scatter plots to see how well 
# fitted values line up with observed malaria cases
plot.fitted.values <- function(fitted.values.df, model.df, title){
  
  # Remove missing values from model data since 
  # model fitting deletes missing observations
  model.df.complete <- model.df |> 
    tidyr::drop_na() |>  
    dplyr::rename_at(vars(starts_with("observed_")), ~ str_c("observed"))

  # Plot fitted versus observed values
  scatter.plot <- ggplot2::ggplot()+ 
                  ggplot2::geom_point(aes(fitted.values.df$fitted.values,
                                          model.df.complete$observed))+
                  ggplot2::theme_classic()+
                  ggplot2::labs(x = "Fitted values",
                                y = "Observed values",
                                title = title)
  return(scatter.plot)
}

# Invoking function 
# 2017 -------------------------------------------------------------------------
fitted_1km_2017 <- plot.fitted.values(model_1km_2017, model_data_2017, "2017: 1km model")

fitted_2km_2017 <- plot.fitted.values(model_2km_2017, model_data_2017, "2017: 2km model")

fitted_3km_2017 <- plot.fitted.values(model_3km_2017, model_data_2017, "2017: 3km model")

# 2018 -------------------------------------------------------------------------
fitted_1km_2018 <- plot.fitted.values(model_1km_2018, model_data_2018, "2018: 1km model")

fitted_2km_2018 <- plot.fitted.values(model_2km_2018, model_data_2018, "2018: 2km model")

fitted_3km_2018 <- plot.fitted.values(model_3km_2018, model_data_2018, "2018: 3km model")

# 2019 -------------------------------------------------------------------------
fitted_1km_2019 <- plot.fitted.values(model_1km_2019, model_data_2019, "2019: 1km model")

fitted_2km_2019 <- plot.fitted.values(model_2km_2019, model_data_2019, "2019: 2km model")

fitted_3km_2019 <- plot.fitted.values(model_3km_2019, model_data_2020, "2019: 3km model")

# 2020 -------------------------------------------------------------------------
fitted_1km_2020 <- plot.fitted.values(model_1km_2020, model_data_2020, "2020: 1km model")

fitted_2km_2020 <- plot.fitted.values(model_2km_2020, model_data_2020, "2020: 2km model")

fitted_3km_2020 <- plot.fitted.values(model_3km_2020, model_data_2020, "2020: 3km model")

# Layout scatter plots ---------------------------------------------------------
cowplot::plot_grid(fitted_1km_2017, fitted_1km_2018, fitted_2km_2018, fitted_3km_2018,
                   fitted_1km_2019, fitted_2km_2019, fitted_1km_2020, ncol = 3, nrow = 3)
Fig. 12. How well the percentage of catchment population living around water bodies explain observed malaria incidence

Fig. 12. How well the percentage of catchment population living around water bodies explain observed malaria incidence

12.1 Test for residual spatial autocorrelation using adjacency as criterion

# Prep data
prep.spatial.dependency.data <- function(sf_2017, sf_2018, sf_2019, sf_2020){
  
  df_2017 <- sf_2017 |>
    dplyr::as_tibble() |> 
    dplyr::rename(observed_2018 = dr_2018,
                  observed_2019 = dr_2019,
                  observed_2020 = dr_2020,
                  SMR_2017 = SMR) |> 
    dplyr::select(rowID, Names, SMR_2017, geometry)
  
  df_2018 <- sf_2018 |> 
    dplyr::as_tibble() |> 
    dplyr::rename(SMR_2018 = SMR) |> 
    dplyr::select(rowID, SMR_2018)
  
  df_2019 <- sf_2019 |> 
    dplyr::as_tibble() |> 
    dplyr::rename(SMR_2019 = SMR) |> 
    dplyr::select(rowID, SMR_2019)
  
  df_2020 <- sf_2020 |> 
    dplyr::as_tibble() |> 
    dplyr::rename(SMR_2020 = SMR) |>
    dplyr::select(rowID, SMR_2020)
    
  
  spatial_dependency_data <- merge(
    merge(
      merge(
        df_2017, df_2018, by = "rowID", all = TRUE),
        df_2019, by = "rowID", all = TRUE),
        df_2020, by = "rowID", all = TRUE)
  
  return(spatial_dependency_data)
  
}

# Invoking function ------------------------------------------------------------
spatial_dependency_data <- prep.spatial.dependency.data(model_data_2017,
                                                        model_data_2018, 
                                                        model_data_2019,
                                                        model_data_2020)

# Find adjacent polygons i.e., make neigbhour list,
# Contiguity neighbors - all that share a boundary point
spatial_dependency_shp <- sf::st_as_sf(spatial_dependency_data) |> 
  as("Spatial")

catchment_neighbours <- spdep::poly2nb(spatial_dependency_shp)  # Queen contiguity

summary(catchment_neighbours)
## Neighbour list object:
## Number of regions: 26 
## Number of nonzero links: 98 
## Percentage nonzero weights: 14.49704 
## Average number of links: 3.769231 
## Link number distribution:
## 
## 1 2 3 4 5 6 7 8 
## 2 3 8 5 5 1 1 1 
## 2 least connected regions:
## 1 26 with 1 link
## 1 most connected region:
## 18 with 8 links
# Get coordinates from catchment polygons
# Get center points of each catchment area

coords <- coordinates(spatial_dependency_shp)


# View the connections
{plot(spatial_dependency_shp, asp = 1)+
plot(catchment_neighbours, coords, col = "blue", add = TRUE)}
Fig. 13. Neighbourhood matrix

Fig. 13. Neighbourhood matrix

## integer(0)
# Run a Moran I test on SMR
moran.test(spatial_dependency_data$SMR_2017, 
           nb2listw(catchment_neighbours))
## 
##  Moran I test under randomisation
## 
## data:  spatial_dependency_data$SMR_2017  
## weights: nb2listw(catchment_neighbours)    
## 
## Moran I statistic standard deviate = 1.0451, p-value = 0.148
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.10287363       -0.04000000        0.01868994
moran.test(spatial_dependency_data$SMR_2018, 
           nb2listw(catchment_neighbours))
## 
##  Moran I test under randomisation
## 
## data:  spatial_dependency_data$SMR_2018  
## weights: nb2listw(catchment_neighbours)    
## 
## Moran I statistic standard deviate = 1.4625, p-value = 0.0718
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.16260895       -0.04000000        0.01919258
moran.test(spatial_dependency_data$SMR_2019, 
           nb2listw(catchment_neighbours))
## 
##  Moran I test under randomisation
## 
## data:  spatial_dependency_data$SMR_2019  
## weights: nb2listw(catchment_neighbours)    
## 
## Moran I statistic standard deviate = 0.64086, p-value = 0.2608
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.05113078       -0.04000000        0.02022102
moran.test(spatial_dependency_data$SMR_2020, 
           nb2listw(catchment_neighbours))
## 
##  Moran I test under randomisation
## 
## data:  spatial_dependency_data$SMR_2020  
## weights: nb2listw(catchment_neighbours)    
## 
## Moran I statistic standard deviate = 1.0093, p-value = 0.1564
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.10407174       -0.04000000        0.02037431
# Run a Moran I MC test on SMR
moran.mc(spatial_dependency_data$SMR_2017, 
         nb2listw(catchment_neighbours), 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  spatial_dependency_data$SMR_2017 
## weights: nb2listw(catchment_neighbours)  
## number of simulations + 1: 1000 
## 
## statistic = 0.10287, observed rank = 843, p-value = 0.157
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2018, 
         nb2listw(catchment_neighbours), 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  spatial_dependency_data$SMR_2018 
## weights: nb2listw(catchment_neighbours)  
## number of simulations + 1: 1000 
## 
## statistic = 0.16261, observed rank = 908, p-value = 0.092
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2019, 
         nb2listw(catchment_neighbours), 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  spatial_dependency_data$SMR_2019 
## weights: nb2listw(catchment_neighbours)  
## number of simulations + 1: 1000 
## 
## statistic = 0.051131, observed rank = 772, p-value = 0.228
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2020, 
         nb2listw(catchment_neighbours), 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  spatial_dependency_data$SMR_2020 
## weights: nb2listw(catchment_neighbours)  
## number of simulations + 1: 1000 
## 
## statistic = 0.10407, observed rank = 839, p-value = 0.161
## alternative hypothesis: greater
# Run a Conditional Autoregressive (CAR) model, which allows us to incorporate 
# the spatial autocorrelation between neighbours within our GLM

# First, generate a weights matrix from a neighbours list with spatial weights
adj_matrix <- spdep::nb2mat(catchment_neighbours, style = "B") # see ?nb2mat

# Match row and column names with those of geographic location index 
rownames(adj_matrix) <- colnames(adj_matrix) <- spatial_dependency_data$rowID
# row.names(adj_matrix) <- NULL # alternatively

# Now we can fit the model. The spatial effect is called using the adjacency function which 
# requires the grouping factor (i.e. the rowID of each catchment area)

CAR_model_1km_2017 <- spaMM::fitme(observed_2017~prop_pop_1km+offset(log(expected_2017)),
                                   adjMatrix = adj_matrix, 
                                   data = model_data_2017, 
                                   family = 'poisson')

# Generate 95% CI
coefs <- as.data.frame(summary(CAR_model_1km_2017)$beta_table)
## formula: observed_2017 ~ prop_pop_1km + offset(log(expected_2017))
## Estimation of fixed effects by ML.
## family: poisson( link = log ) 
##  ------------ Fixed effects (beta) ------------
##              Estimate  Cond. SE t-value
## (Intercept)  -0.30397 0.0076437  -39.77
## prop_pop_1km  0.03355 0.0006191   54.19
##  ------------- Likelihood values  -------------
##                         logLik
## p(h)   (Likelihood): -4854.394
# Moran's I contiguity test
MI_2017 <- spdep::moran(model_data_2017$observed_2017, 
                        nb2listw(catchment_neighbours),
                        length(model_data_2017$observed_2017),
                        Szero(nb2listw(catchment_neighbours)))

12.2 Run a Kriging model

Let’s use autoKrige() function in the automap package to compute binned variograms, fit Kriging model, do model selection, and performs kriging by making multiple calls to the gstat functions.

# Get point location of health facilities and convert to spatial object
zipatala <-zipatala_aggregated_sf |> 
  sf::st_transform(32736) # reproject

zipatala_sp <- as(zipatala, "Spatial")

# Make a vector that is TRUE for the missing dry season malaria data
# 2017
missing_2017 <- is.na(zipatala_sp$dr_2017)
table(missing_2017) # no missing data
## missing_2017
## FALSE 
##    26
# Plot a map of dry season malaria cases
# raster::spplot(zipatala_sp, zcol = "dr_2017")


# Kriging with linear trend, predicting over the missing points
malaria_autoKrige_2017 <- automap::autoKrige(
  dr_2017 ~ coords.x1 + coords.x2, 
  input_data = zipatala_sp[!missing_2017, ], 
  new_data = zipatala_sp, 
  model = "Mat") # Matern
## [using universal kriging]
# Plot the variogram, predictions, and standard error
plot(malaria_autoKrige_2017)

13 Yearly variation in observed malaria cases as a risk factor

# Prep model data --------------------------------------------------------------
df2017 <- model_data_2017 |> 
  dplyr::as_tibble() |> 
    dplyr::rename(observed_cases = observed_2017,
                  expected_cases = expected_2017,
                  health_facility = Names) |> 
    dplyr::select(-geometry, -fid,-DN, -X, -SMR, 
                  -pop_2017, -dr_2018, -dr_2019, -dr_2020)

df2017$year <- "2017" # add new column

# df2017 <- cbind(df2017, year = "2017") # alternatively

df2018 <- model_data_2018 |> 
  dplyr::as_tibble() |> 
    dplyr::rename(observed_cases = observed_2018,
                  expected_cases = expected_2018) |> 
    dplyr::select(-geometry, -fid,-DN, -X, -SMR,
                  -pop_2018,-dr_2017, -dr_2019, -dr_2020)

df2018$year <- "2018"

colnames(df2018) <- colnames(df2017) # match columns names

df2019 <- model_data_2019 |> 
  dplyr::as_tibble() |> 
    dplyr::rename(observed_cases = observed_2019,
                  expected_cases = expected_2019) |> 
    dplyr::select(-geometry, -fid,-DN, -X, -SMR,
                  -pop_2019, -dr_2017, -dr_2018, -dr_2020)

df2019$year <- "2019"

colnames(df2019) <- colnames(df2017) # match columns names

df2020 <- model_data_2020 |> 
  dplyr::as_tibble() |> 
    dplyr::rename(observed_cases = observed_2020,
                  expected_cases = expected_2020) |> 
    dplyr::select(-geometry, -fid,-DN, -X, -SMR,
                  -pop_2020, -dr_2017, -dr_2018, -dr_2019) 

df2020$year <- "2020"

colnames(df2020) <- colnames(df2017) # match columns names

model_data <- rbind(df2017, df2018, df2019, df2020)
  


model_data |>     # View model data in table format
  gt::gt() |> 
  gt::tab_style(style = list(cell_text(align = "center")),
                locations = cells_column_labels() ) |> 
  gt::cols_label(health_facility = "Health facility",
                 observed_cases = "Observed cases",
                 expected_cases = "Expected cases",
                 prop_pop_1km = "Proportion of population in 1km buffers",
                 prop_pop_2km = "Proportion of population in 2km buffers",
                 prop_pop_3km   = "Proportion of population in 3km buffers",
                 year = "Year")
rowID Health facility Observed cases Expected cases Proportion of population in 1km buffers Proportion of population in 2km buffers Proportion of population in 3km buffers Year
1 Lodjwa Health Centre 564 826 10 7 19 2017
2 Nkhamenya Rural Hospital 2720 3344 4 16 30 2017
3 Newa Mpasazi Health Centre 216 1156 1 7 14 2017
4 Mpepa /Chisinga Health Centre 1523 2287 NA NA NA 2017
5 Mnyanja Health Centre 1480 3327 2 6 19 2017
6 Simlemba Health Centre 1159 2249 2 8 17 2017
7 Ofesi Health Centre 1930 2340 NA 2 2 2017
8 Chulu Health Centre 3482 2324 7 23 45 2017
9 Kapelula Health Centre 2970 2976 2 8 16 2017
10 Livwezi Health Centre 594 1833 8 20 40 2017
11 Gogode Dispensary 1553 1088 6 19 39 2017
12 Dwangwa Dispensary 1153 2724 9 29 54 2017
13 Chamama Health Facility 1005 1668 2 9 16 2017
14 Wimbe Health Centre 2558 988 21 53 80 2017
15 Chinyama 1140 1063 11 30 46 2017
16 Mdunga Health Centre 1382 1514 NA NA NA 2017
17 Mtunthama Health Centre 1982 1561 21 64 82 2017
18 Kasungu District Hospital 14663 11951 18 39 58 2017
19 Chamwabvi Health Centre 2031 2945 7 24 46 2017
20 Linyangwa Health Centre 1987 1480 5 20 41 2017
21 Mziza Health Centre 4098 3689 13 30 51 2017
22 Kawamba Health Centre 3845 2198 13 34 60 2017
23 Kamboni Health Centre 2588 1768 6 20 33 2017
24 Khola Health Centre 1012 1888 2 8 16 2017
25 Santhe Health Centre 5668 3660 4 13 21 2017
26 Mkhota Health Centre 1487 1940 5 17 24 2017
1 Lodjwa Health Centre 1151 934 NA NA NA 2018
2 Nkhamenya Rural Hospital 3343 3785 3 6 13 2018
3 Newa Mpasazi Health Centre 434 1295 1 7 16 2018
4 Mpepa /Chisinga Health Centre 2616 2589 4 16 31 2018
5 Mnyanja Health Centre 1715 3804 3 11 30 2018
6 Simlemba Health Centre 1506 2496 3 10 23 2018
7 Ofesi Health Centre 1773 2636 2 5 10 2018
8 Chulu Health Centre 3330 2621 11 39 65 2018
9 Kapelula Health Centre 3480 3420 9 27 51 2018
10 Livwezi Health Centre 1128 2049 8 25 54 2018
11 Gogode Dispensary 2550 1215 9 22 44 2018
12 Dwangwa Dispensary 1216 3048 10 37 61 2018
13 Chamama Health Facility 1226 1852 3 12 22 2018
14 Wimbe Health Centre 3167 1074 20 54 80 2018
15 Chinyama 1673 1194 14 33 48 2018
16 Mdunga Health Centre 1894 1720 7 17 45 2018
17 Mtunthama Health Centre 3358 1734 25 69 84 2018
18 Kasungu District Hospital 12019 13377 19 50 66 2018
19 Chamwabvi Health Centre 2079 3287 16 46 70 2018
20 Linyangwa Health Centre 1500 1639 5 18 39 2018
21 Mziza Health Centre 2291 4210 17 40 58 2018
22 Kawamba Health Centre 3881 2386 33 62 84 2018
23 Kamboni Health Centre 3250 1948 19 36 53 2018
24 Khola Health Centre 1697 2108 5 14 28 2018
25 Santhe Health Centre 6195 4096 6 21 36 2018
26 Mkhota Health Centre 4218 2171 17 40 66 2018
1 Lodjwa Health Centre 1168 909 10 3 10 2019
2 Nkhamenya Rural Hospital 3932 3709 10 29 58 2019
3 Newa Mpasazi Health Centre 626 1266 5 18 44 2019
4 Mpepa /Chisinga Health Centre 4169 2523 3 11 19 2019
5 Mnyanja Health Centre 2504 3751 7 17 36 2019
6 Simlemba Health Centre 1788 2405 11 32 55 2019
7 Ofesi Health Centre 2124 2576 5 20 36 2019
8 Chulu Health Centre 3537 2547 27 60 86 2019
9 Kapelula Health Centre 3357 3405 10 35 54 2019
10 Livwezi Health Centre 435 1966 17 43 72 2019
11 Gogode Dispensary 1469 1169 10 22 42 2019
12 Dwangwa Dispensary 1370 2948 22 64 85 2019
13 Chamama Health Facility 1127 1773 2 13 18 2019
14 Wimbe Health Centre 2162 1016 21 54 81 2019
15 Chinyama 1260 1154 14 35 48 2019
16 Mdunga Health Centre 1485 1710 8 33 50 2019
17 Mtunthama Health Centre 1718 1661 30 76 91 2019
18 Kasungu District Hospital 13052 12942 25 58 90 2019
19 Chamwabvi Health Centre 1180 3161 13 54 77 2019
20 Linyangwa Health Centre 2692 1566 15 45 77 2019
21 Mziza Health Centre 3135 4151 28 66 89 2019
22 Kawamba Health Centre 3469 2258 42 78 94 2019
23 Kamboni Health Centre 2537 1843 28 52 78 2019
24 Khola Health Centre 2139 2040 10 19 34 2019
25 Santhe Health Centre 5793 3957 24 58 81 2019
26 Mkhota Health Centre 2268 2093 23 53 86 2019
1 Lodjwa Health Centre 1788 1538 1 4 9 2020
2 Nkhamenya Rural Hospital 8539 6313 5 20 42 2020
3 Newa Mpasazi Health Centre 2182 2153 2 8 17 2020
4 Mpepa /Chisinga Health Centre 5186 4270 NA NA NA 2020
5 Mnyanja Health Centre 6117 6426 1 2 5 2020
6 Simlemba Health Centre 5310 4026 3 10 22 2020
7 Ofesi Health Centre 2323 4379 NA 3 2 2020
8 Chulu Health Centre 7160 4308 13 36 54 2020
9 Kapelula Health Centre 7297 5904 NA NA NA 2020
10 Livwezi Health Centre 1028 3267 7 17 38 2020
11 Gogode Dispensary 2767 1961 7 20 40 2020
12 Dwangwa Dispensary 2869 4971 9 34 58 2020
13 Chamama Health Facility 635 2969 2 8 14 2020
14 Wimbe Health Centre 2233 1689 21 53 80 2020
15 Chinyama 1605 1936 11 28 45 2020
16 Mdunga Health Centre 3169 2952 NA NA NA 2020
17 Mtunthama Health Centre 1882 2763 20 65 82 2020
18 Kasungu District Hospital 19393 21785 19 40 62 2020
19 Chamwabvi Health Centre 1128 5304 9 28 53 2020
20 Linyangwa Health Centre 4380 2604 5 20 42 2020
21 Mziza Health Centre 5791 7131 18 29 45 2020
22 Kawamba Health Centre 7073 3771 19 42 64 2020
23 Kamboni Health Centre 4665 3028 8 28 43 2020
24 Khola Health Centre 3426 3453 4 13 21 2020
25 Santhe Health Centre 6556 6667 5 17 31 2020
26 Mkhota Health Centre 4592 3526 10 30 55 2020
# Model fitting ----------------------------------------------------------------
multivariate_1km <- glm(observed_cases~1+prop_pop_1km+year+offset(log(expected_cases)),
                        data = model_data, family = 'poisson')

summary(multivariate_1km)
## 
## Call:
## glm(formula = observed_cases ~ 1 + prop_pop_1km + year + offset(log(expected_cases)), 
##     family = "poisson", data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -67.623  -17.653    2.957   14.198   47.020  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.1134752  0.0046548  -24.38   <2e-16 ***
## prop_pop_1km  0.0145494  0.0002322   62.67   <2e-16 ***
## year2018     -0.0660683  0.0055799  -11.84   <2e-16 ***
## year2019     -0.1558530  0.0059832  -26.05   <2e-16 ***
## year2020     -0.0525871  0.0051717  -10.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 58585  on 95  degrees of freedom
## Residual deviance: 54663  on 91  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 55593
## 
## Number of Fisher Scoring iterations: 4
# Build regression model table
regression_table_1km <- gtsummary::tbl_regression(
  multivariate_1km, exponentiate = FALSE) |> 
  gtsummary::bold_p()

sjPlot::tab_model(multivariate_1km, show.r2 = FALSE, show.aic = TRUE)
  observed_cases
Predictors Incidence Rate Ratios CI p
(Intercept) 0.89 0.88 – 0.90 <0.001
prop_pop_1km 1.01 1.01 – 1.02 <0.001
year [2018] 0.94 0.93 – 0.95 <0.001
year [2019] 0.86 0.85 – 0.87 <0.001
year [2020] 0.95 0.94 – 0.96 <0.001
Observations 96
AIC 55593.162
multivariate_2km <- glm(observed_cases~1+prop_pop_2km+year+offset(log(expected_cases)),
                        data = model_data, family = 'poisson')

summary(multivariate_2km)
## 
## Call:
## glm(formula = observed_cases ~ 1 + prop_pop_2km + year + offset(log(expected_cases)), 
##     family = "poisson", data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -68.727  -17.922    2.973   15.854   45.953  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.1381816  0.0048236  -28.65   <2e-16 ***
## prop_pop_2km  0.0064459  0.0001094   58.94   <2e-16 ***
## year2018     -0.0730332  0.0055760  -13.10   <2e-16 ***
## year2019     -0.1569054  0.0060366  -25.99   <2e-16 ***
## year2020     -0.0587958  0.0050935  -11.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 59818  on 97  degrees of freedom
## Residual deviance: 56275  on 93  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 57225
## 
## Number of Fisher Scoring iterations: 4
regression_table_2km <- gtsummary::tbl_regression(
  multivariate_2km)|> 
  gtsummary::bold_p()

sjPlot::tab_model(multivariate_2km, show.r2 = FALSE, show.aic = TRUE)
  observed_cases
Predictors Incidence Rate Ratios CI p
(Intercept) 0.87 0.86 – 0.88 <0.001
prop_pop_2km 1.01 1.01 – 1.01 <0.001
year [2018] 0.93 0.92 – 0.94 <0.001
year [2019] 0.85 0.84 – 0.86 <0.001
year [2020] 0.94 0.93 – 0.95 <0.001
Observations 98
AIC 57224.629
multivariate_3km <- glm(observed_cases~1+prop_pop_3km+year+offset(log(expected_cases)),
                        data = model_data, family = 'poisson')

summary(multivariate_3km)
## 
## Call:
## glm(formula = observed_cases ~ 1 + prop_pop_3km + year + offset(log(expected_cases)), 
##     family = "poisson", data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -70.940  -17.567    1.537   16.566   45.523  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.848e-01  5.347e-03  -34.57   <2e-16 ***
## prop_pop_3km  5.034e-03  8.687e-05   57.95   <2e-16 ***
## year2018     -7.087e-02  5.567e-03  -12.73   <2e-16 ***
## year2019     -1.624e-01  6.084e-03  -26.69   <2e-16 ***
## year2020     -6.158e-02  5.098e-03  -12.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 59818  on 97  degrees of freedom
## Residual deviance: 56333  on 93  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 57282
## 
## Number of Fisher Scoring iterations: 4
regression_table_3km <- gtsummary::tbl_regression(
  multivariate_3km, exponentiate = FALSE) |> 
  gtsummary::bold_p()

sjPlot::tab_model(multivariate_3km, show.r2 = FALSE, show.aic = TRUE)
  observed_cases
Predictors Incidence Rate Ratios CI p
(Intercept) 0.83 0.82 – 0.84 <0.001
prop_pop_3km 1.01 1.00 – 1.01 <0.001
year [2018] 0.93 0.92 – 0.94 <0.001
year [2019] 0.85 0.84 – 0.86 <0.001
year [2020] 0.94 0.93 – 0.95 <0.001
Observations 98
AIC 57282.361
# merge regression model tables 
table_merge <-gtsummary::tbl_merge(
    tbls = list(regression_table_1km,
                regression_table_2km,
                regression_table_3km),
    tab_spanner = c("**Model 1km**", 
                    "**Model 2km**", 
                    "**Model 3km**"))

table_merge
Characteristic Model 1km Model 2km Model 3km
log(IRR)1 95% CI1 p-value log(IRR)1 95% CI1 p-value log(IRR)1 95% CI1 p-value
prop_pop_1km 0.01 0.01, 0.02 <0.001
year
2017
2018 -0.07 -0.08, -0.06 <0.001 -0.07 -0.08, -0.06 <0.001 -0.07 -0.08, -0.06 <0.001
2019 -0.16 -0.17, -0.14 <0.001 -0.16 -0.17, -0.15 <0.001 -0.16 -0.17, -0.15 <0.001
2020 -0.05 -0.06, -0.04 <0.001 -0.06 -0.07, -0.05 <0.001 -0.06 -0.07, -0.05 <0.001
prop_pop_2km 0.01 0.01, 0.01 <0.001
prop_pop_3km 0.01 0.00, 0.01 <0.001

1 IRR = Incidence Rate Ratio, CI = Confidence Interval

# Check model perfomance -------------------------------------------------------

performance::check_model(multivariate_1km, theme = "see::theme_modern")

performance::check_model(multivariate_2km, theme = "see::theme_modern")

performance::check_model(multivariate_3km, theme = "see::theme_modern")

performance::compare_performance(
  multivariate_1km, multivariate_2km, multivariate_3km,
  metrics = c("AIC", "BIC", "RMSE", "Sigma"), rank = TRUE)
## # Comparison of Model Performance Indices
## 
## Name             | Model |       AIC |       BIC |     RMSE |  Sigma | Performance-Score
## ----------------------------------------------------------------------------------------
## multivariate_1km |   glm | 55593.162 | 55605.984 | 1382.086 | 24.509 |            75.00%
## multivariate_2km |   glm | 57224.629 | 57237.554 | 1354.368 | 24.599 |            29.78%
## multivariate_3km |   glm | 57282.361 | 57295.286 | 1355.655 | 24.612 |            23.84%

13.1 Estimate prediction error

Since our dataset is small and partitioning it will lead to higher bias. Therefore, we use the Leave one out cross validation (LOOCV) method which uses all data points to measure performance of the Poisson model. This method works as follow:

  1. Leave out one data point and build the model on the rest of the data set
  2. Test the model against the data point that is left out at step 1 and record the test error associated with the prediction
  3. Repeat the process for all data points
  4. Compute the overall prediction error by taking the average of all these test error estimates recorded at step 2.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

# Define training control
train_control <-  caret::trainControl(method = "LOOCV")

# Removing missing values
model_data_evaluation <- model_data[complete.cases(model_data),] 



# Train the model, fit a regression model and use LOOCV to evaluate perfomance
train_model_1km <- caret::train(observed_cases~1+prop_pop_1km+year+offset(log(expected_cases)),
                                data = model_data_evaluation, method = "glm", trControl = train_control)

train_model_2km <- caret::train(observed_cases~1+prop_pop_2km+year+offset(log(expected_cases)),
                                data = model_data_evaluation, method = "glm", trControl = train_control)


train_model_3km <- caret::train(observed_cases~1+prop_pop_3km+year+offset(log(expected_cases)),
                                data = model_data_evaluation, method = "glm", trControl = train_control)

# Summarize the results
print(train_model_1km)
## Generalized Linear Model 
## 
## 96 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 95, 95, 95, 95, 95, 95, ... 
## Resampling results:
## 
##   RMSE     Rsquared    MAE     
##   2902.38  0.07939932  1824.458
summary(train_model_1km)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4027.0  -1558.4   -511.0    609.8  13603.9  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1674.18     652.17   2.567   0.0119 * 
## prop_pop_1km   119.83      36.83   3.254   0.0016 **
## year2018      -101.95     819.08  -0.124   0.9012   
## year2019      -898.45     861.27  -1.043   0.2996   
## year2020      1838.26     839.10   2.191   0.0310 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7892781)
## 
##     Null deviance: 861676767  on 95  degrees of freedom
## Residual deviance: 718243042  on 91  degrees of freedom
## AIC: 1803.9
## 
## Number of Fisher Scoring iterations: 2
print(train_model_2km)
## Generalized Linear Model 
## 
## 96 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 95, 95, 95, 95, 95, 95, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2945.181  0.0543699  1813.932
summary(train_model_2km)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4476.3  -1393.6   -533.1    695.6  14138.4  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1639.26     699.94   2.342  0.02136 * 
## prop_pop_2km    44.15      16.71   2.642  0.00971 **
## year2018       -43.94     834.22  -0.053  0.95810   
## year2019      -707.49     874.67  -0.809  0.42070   
## year2020      1849.27     854.74   2.164  0.03312 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8183301)
## 
##     Null deviance: 861676767  on 95  degrees of freedom
## Residual deviance: 744680433  on 91  degrees of freedom
## AIC: 1807.4
## 
## Number of Fisher Scoring iterations: 2
print(train_model_3km)
## Generalized Linear Model 
## 
## 96 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 95, 95, 95, 95, 95, 95, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE    
##   2944.342  0.05423955  1816.52
summary(train_model_3km)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4114.8  -1413.4   -560.7    464.7  14094.9  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1289.93     778.78   1.656   0.1011   
## prop_pop_3km    34.93      13.28   2.631   0.0100 **
## year2018       -73.00     836.09  -0.087   0.9306   
## year2019      -716.15     876.50  -0.817   0.4160   
## year2020      1842.39     855.17   2.154   0.0338 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8188035)
## 
##     Null deviance: 861676767  on 95  degrees of freedom
## Residual deviance: 745111188  on 91  degrees of freedom
## AIC: 1807.4
## 
## Number of Fisher Scoring iterations: 2
# The lower the Mean Absolute Error, the more closely a model can predict the actual observations

Model-based Geostatistics

# Function to split an scf_POINT object with single geometry column to 
# columns with separate longitude and latitude
separate.coordinates <- function(x, names = c("long","lat")) {
  stopifnot(inherits(x,"sf") && inherits(sf::st_geometry(x), "sfc_POINT"))
  ret <- do.call(rbind, sf::st_geometry(x))
  ret <- tibble::as_tibble(ret)
  stopifnot(length(names) == ncol(ret))
  ret <- setNames(ret, names)
  dplyr::bind_cols(x, ret)
}

catchment_points <- separate.coordinates(zipatala_aggregated_sf) |> 
  dplyr::select(rowID, long, lat)

 variogram_data <- merge(as.data.frame(catchment_points),
                         as.data.frame(model_data_2017), 
                         by = "rowID") |> 
  dplyr::select(long, lat, prop_pop_1km, prop_pop_2km, prop_pop_3km,
                observed_2017, expected_2017)
 
 # write.csv(variogram_data, "data/variogram_data.csv")

# Open the MBG app in R. Type the following into the R console:
 
# shiny::runGitHub(repo="MBGapp", username= "olatunjijohnson",
#                  ref="main", subdir = "inst/MBGapp")